1.8 KiB

Raw Blame History

Worked Example: LLM Memory Estimation on Dell Pro Max GB10

Problem

Estimate whether Llama 3.3 70B can run on a single Dell Pro Max GB10, and at what precision.

Given

Model: Llama 3.3 70B (70 billion parameters)
Available memory: 128 GB unified LPDDR5X
Usable memory: ~110 GB (after OS, framework, overhead)

Calculation

Step 1: Raw Model Weight Memory

Precision	Bytes/Param	Memory for 70B
FP4	0.5	70 × 0.5 = 35 GB
FP8/INT8	1.0	70 × 1.0 = 70 GB
FP16	2.0	70 × 2.0 = 140 GB
FP32	4.0	70 × 4.0 = 280 GB

Step 2: Total Memory with Overhead (1.3x multiplier)

Precision	Weights	Total (~1.3x)	Fits in 110 GB?
FP4	35 GB	~46 GB	Yes
FP8/INT8	70 GB	~91 GB	Yes
FP16	140 GB	~182 GB	No
FP32	280 GB	~364 GB	No

Step 3: Conclusion

FP4 quantized: Fits comfortably (46/110 GB = 42% utilization). Plenty of room for large KV cache and batch sizes.
FP8/INT8 quantized: Fits (91/110 GB = 83% utilization). Tight but workable for single-request inference.
FP16 (half precision): Does NOT fit in a single unit. Would require 2-unit stacking (see multi-unit-stacking).
FP32 (full precision): Does NOT fit even with stacking.

Verification

NVIDIA confirms Llama 3.3 70B runs locally on a single GB10 unit. This is consistent with FP8 or FP4 quantized inference, which our calculation shows fitting within memory bounds.

Sources

Memory specs: memory-and-storage
Estimation formulas: equations-and-bounds
Model capabilities: ai-workloads

1.8 KiB Raw Blame History