You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

1.8 KiB

Worked Example: LLM Memory Estimation on Dell Pro Max GB10

Problem

Estimate whether Llama 3.3 70B can run on a single Dell Pro Max GB10, and at what precision.

Given

  • Model: Llama 3.3 70B (70 billion parameters)
  • Available memory: 128 GB unified LPDDR5X
  • Usable memory: ~110 GB (after OS, framework, overhead)

Calculation

Step 1: Raw Model Weight Memory

Precision Bytes/Param Memory for 70B
FP4 0.5 70 × 0.5 = 35 GB
FP8/INT8 1.0 70 × 1.0 = 70 GB
FP16 2.0 70 × 2.0 = 140 GB
FP32 4.0 70 × 4.0 = 280 GB

Step 2: Total Memory with Overhead (1.3x multiplier)

Precision Weights Total (~1.3x) Fits in 110 GB?
FP4 35 GB ~46 GB Yes
FP8/INT8 70 GB ~91 GB Yes
FP16 140 GB ~182 GB No
FP32 280 GB ~364 GB No

Step 3: Conclusion

  • FP4 quantized: Fits comfortably (46/110 GB = 42% utilization). Plenty of room for large KV cache and batch sizes.
  • FP8/INT8 quantized: Fits (91/110 GB = 83% utilization). Tight but workable for single-request inference.
  • FP16 (half precision): Does NOT fit in a single unit. Would require 2-unit stacking (see multi-unit-stacking).
  • FP32 (full precision): Does NOT fit even with stacking.

Verification

NVIDIA confirms Llama 3.3 70B runs locally on a single GB10 unit. This is consistent with FP8 or FP4 quantized inference, which our calculation shows fitting within memory bounds.

Sources