You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
1.8 KiB
1.8 KiB
Worked Example: LLM Memory Estimation on Dell Pro Max GB10
Problem
Estimate whether Llama 3.3 70B can run on a single Dell Pro Max GB10, and at what precision.
Given
- Model: Llama 3.3 70B (70 billion parameters)
- Available memory: 128 GB unified LPDDR5X
- Usable memory: ~110 GB (after OS, framework, overhead)
Calculation
Step 1: Raw Model Weight Memory
| Precision | Bytes/Param | Memory for 70B |
|---|---|---|
| FP4 | 0.5 | 70 × 0.5 = 35 GB |
| FP8/INT8 | 1.0 | 70 × 1.0 = 70 GB |
| FP16 | 2.0 | 70 × 2.0 = 140 GB |
| FP32 | 4.0 | 70 × 4.0 = 280 GB |
Step 2: Total Memory with Overhead (1.3x multiplier)
| Precision | Weights | Total (~1.3x) | Fits in 110 GB? |
|---|---|---|---|
| FP4 | 35 GB | ~46 GB | Yes |
| FP8/INT8 | 70 GB | ~91 GB | Yes |
| FP16 | 140 GB | ~182 GB | No |
| FP32 | 280 GB | ~364 GB | No |
Step 3: Conclusion
- FP4 quantized: Fits comfortably (46/110 GB = 42% utilization). Plenty of room for large KV cache and batch sizes.
- FP8/INT8 quantized: Fits (91/110 GB = 83% utilization). Tight but workable for single-request inference.
- FP16 (half precision): Does NOT fit in a single unit. Would require 2-unit stacking (see multi-unit-stacking).
- FP32 (full precision): Does NOT fit even with stacking.
Verification
NVIDIA confirms Llama 3.3 70B runs locally on a single GB10 unit. This is consistent with FP8 or FP4 quantized inference, which our calculation shows fitting within memory bounds.
Sources
- Memory specs: memory-and-storage
- Estimation formulas: equations-and-bounds
- Model capabilities: ai-workloads