6.4 KiB
| id | title | status | source_sections | related_topics | key_equations | key_terms | images | examples | open_questions |
|---|---|---|---|---|---|---|---|---|---|
| ai-workloads | AI Workloads and Model Capabilities | established | Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks | [gb10-superchip memory-and-storage ai-frameworks multi-unit-stacking] | [model-memory-estimate] | [llm inference fine-tuning quantization fp4 fp8 fp16 parameter-count lora qlora sft fsdp speculative-decoding nemotron comfyui rapids-singlecell] | [] | [llm-memory-estimation.md] | [Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far) Maximum batch size for inference at various model sizes Training from scratch — is it practical for any meaningful model size? Speculative decoding speedup factor (tokens/sec improvement not quantified yet)] |
AI Workloads and Model Capabilities
The Dell Pro Max GB10 is designed primarily for local AI inference and fine-tuning, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor.
1. Headline Capabilities
- Up to 200 billion parameter models locally (with quantization)
- 1 PFLOP (1,000 TFLOPS) at FP4 precision with sparsity
- Llama 3.3 70B confirmed to run locally (single unit)
- Up to 405B parameter models with two-unit stacking (see multi-unit-stacking)
1a. Measured Performance (T2 Benchmarked)
| Model / Workload | Performance | Precision | Source |
|---|---|---|---|
| Llama 3.2 3B | ~100 tokens/sec | — | Jeff Geerling |
| GPT-OSS-120B | ~14.5 tokens/sec | INT4 | ServeTheHome |
| Llama 3.1 70B | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling |
| Nemotron-3-Nano 30B | Runs (MoE, 3B active) | Q8_K | build.nvidia.com/spark |
| HPL (Linpack) FP64 | ~675 GFLOPS | FP64 | Jeff Geerling |
| Geekbench 6 | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling |
Prompt processing is noted as a particular strength of the system (T2, Jeff Geerling).
INT4 inference on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome).
Nemotron-3-Nano 30B is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark)
2. Model Size vs. Memory
With 128 GB of unified memory, the system can hold:
| Precision | Bytes/Param | Max Params (approx) | Example Models |
|---|---|---|---|
| FP4 | 0.5 B | ~200B+ | Large quantized models |
| FP8/INT8 | 1 B | ~100B | Llama 3.3 70B, Mixtral |
| FP16 | 2 B | ~50-55B | Medium models at full prec |
| FP32 | 4 B | ~25-28B | Small models, debugging |
Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.
3. Primary Use Cases
Local LLM Inference
- Run large language models privately, no cloud dependency
- Interactive chat, code generation, document analysis
- Privacy-sensitive applications (medical, legal, financial)
Fine-Tuning (T1 Documented, build.nvidia.com/spark)
NVIDIA provides official fine-tuning scripts with four approaches:
| Script | Model | Method | Notes |
|---|---|---|---|
| Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 |
| LoRA | Llama 3.1 8B | Parameter-efficient adapters | lora_rank=8 default |
| LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm |
| QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit |
- Container:
nvcr.io/nvidia/pytorch:25.11-py3 - Dependencies:
transformers,peft,datasets,trl,bitsandbytes - Key params:
--batch_size,--seq_length(default 2048),--num_epochs,--gradient_checkpointing - Dataset: Alpaca (configurable
--dataset_size, default 512 samples) - Multi-node: Docker Swarm + FSDP for 2-unit distributed training
AI Prototyping
- Rapid iteration on model architectures
- Dataset preprocessing with RAPIDS
- Experiment tracking and evaluation
Image Generation (T1 Documented, build.nvidia.com/spark)
- ComfyUI confirmed working — node-based UI for Stable Diffusion, SDXL, Flux
- Runs natively on Blackwell GPU with CUDA 13.0
- See ai-frameworks §4 for setup details
Speculative Decoding (T1 Documented, build.nvidia.com/spark)
- Accelerates LLM inference by using a small draft model to predict tokens verified by the large model
- EAGLE-3: Built-in drafting head (no separate model needed)
- Draft-Target: Pairs 8B draft + 70B target with FP4 quantization
- Uses TensorRT-LLM container (
tensorrt-llm/release:1.2.0rc6) - Configurable
max_draft_len(1-8 tokens) and KV cache memory fraction
Data Science
- GPU-accelerated analytics with RAPIDS
- Large-scale data processing
- Graph analytics
Scientific Computing (T1 Documented, build.nvidia.com/spark)
Single-cell RNA Sequencing:
- RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API)
- Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s)
- Requires ~40 GB unified memory
- Computes exact nearest-neighbor graph (vs. Scanpy's approximate)
Portfolio Optimization:
- cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE
- Mean-CVaR (Conditional Value-at-Risk) modeling
- Full pipeline in ~7 minutes
Gaming (bonus, not primary use case)
Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton:
- Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling)
- Doom Eternal: ~200 fps (T2, Jeff Geerling)
Not recommended as a gaming machine — this is a development tool, not a GeForce.
4. Target Users
- AI researchers and developers
- Privacy-conscious organizations
- Academic institutions
- AI prototyping teams
- Independent developers building AI applications
Key Relationships
- Compute provided by: gb10-superchip
- Memory constraints: memory-and-storage
- Frameworks used: ai-frameworks
- Scaling beyond single unit: multi-unit-stacking