--- id: ai-workloads title: "AI Workloads and Model Capabilities" status: established source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks" related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking] key_equations: [model-memory-estimate] key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count, lora, qlora, sft, fsdp, speculative-decoding, nemotron, comfyui, rapids-singlecell] images: [] examples: [llm-memory-estimation.md] open_questions: - "Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far)" - "Maximum batch size for inference at various model sizes" - "Training from scratch — is it practical for any meaningful model size?" - "Speculative decoding speedup factor (tokens/sec improvement not quantified yet)" --- # AI Workloads and Model Capabilities The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tuning**, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor. ## 1. Headline Capabilities - **Up to 200 billion parameter models** locally (with quantization) - **1 PFLOP (1,000 TFLOPS)** at FP4 precision with sparsity - **Llama 3.3 70B** confirmed to run locally (single unit) - **Up to 405B parameter models** with two-unit stacking (see [[multi-unit-stacking]]) ## 1a. Measured Performance (T2 Benchmarked) | Model / Workload | Performance | Precision | Source | |-----------------------|-------------------|-----------|----------------| | Llama 3.2 3B | ~100 tokens/sec | — | Jeff Geerling | | GPT-OSS-120B | ~14.5 tokens/sec | INT4 | ServeTheHome | | Llama 3.1 70B | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling | | Nemotron-3-Nano 30B | Runs (MoE, 3B active) | Q8_K | build.nvidia.com/spark | | HPL (Linpack) FP64 | ~675 GFLOPS | FP64 | Jeff Geerling | | Geekbench 6 | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling | **Prompt processing** is noted as a particular strength of the system (T2, Jeff Geerling). **INT4 inference** on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome). **Nemotron-3-Nano 30B** is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark) ## 2. Model Size vs. Memory With 128 GB of unified memory, the system can hold: | Precision | Bytes/Param | Max Params (approx) | Example Models | |-----------|-------------|----------------------|---------------------------| | FP4 | 0.5 B | ~200B+ | Large quantized models | | FP8/INT8 | 1 B | ~100B | Llama 3.3 70B, Mixtral | | FP16 | 2 B | ~50-55B | Medium models at full prec | | FP32 | 4 B | ~25-28B | Small models, debugging | *Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.* ## 3. Primary Use Cases ### Local LLM Inference - Run large language models privately, no cloud dependency - Interactive chat, code generation, document analysis - Privacy-sensitive applications (medical, legal, financial) ### Fine-Tuning (T1 Documented, build.nvidia.com/spark) NVIDIA provides official fine-tuning scripts with four approaches: | Script | Model | Method | Notes | |--------|-------|--------|-------| | Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 | | LoRA | Llama 3.1 8B | Parameter-efficient adapters | `lora_rank=8` default | | LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm | | QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit | - Container: `nvcr.io/nvidia/pytorch:25.11-py3` - Dependencies: `transformers`, `peft`, `datasets`, `trl`, `bitsandbytes` - Key params: `--batch_size`, `--seq_length` (default 2048), `--num_epochs`, `--gradient_checkpointing` - Dataset: Alpaca (configurable `--dataset_size`, default 512 samples) - Multi-node: Docker Swarm + FSDP for 2-unit distributed training ### AI Prototyping - Rapid iteration on model architectures - Dataset preprocessing with RAPIDS - Experiment tracking and evaluation ### Image Generation (T1 Documented, build.nvidia.com/spark) - **ComfyUI** confirmed working — node-based UI for Stable Diffusion, SDXL, Flux - Runs natively on Blackwell GPU with CUDA 13.0 - See [[ai-frameworks]] §4 for setup details ### Speculative Decoding (T1 Documented, build.nvidia.com/spark) - Accelerates LLM inference by using a small draft model to predict tokens verified by the large model - **EAGLE-3:** Built-in drafting head (no separate model needed) - **Draft-Target:** Pairs 8B draft + 70B target with FP4 quantization - Uses TensorRT-LLM container (`tensorrt-llm/release:1.2.0rc6`) - Configurable `max_draft_len` (1-8 tokens) and KV cache memory fraction ### Data Science - GPU-accelerated analytics with RAPIDS - Large-scale data processing - Graph analytics ### Scientific Computing (T1 Documented, build.nvidia.com/spark) **Single-cell RNA Sequencing:** - RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API) - Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s) - Requires ~40 GB unified memory - Computes exact nearest-neighbor graph (vs. Scanpy's approximate) **Portfolio Optimization:** - cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE - Mean-CVaR (Conditional Value-at-Risk) modeling - Full pipeline in ~7 minutes ### Gaming (bonus, not primary use case) Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton: - Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling) - Doom Eternal: ~200 fps (T2, Jeff Geerling) **Not recommended as a gaming machine** — this is a development tool, not a GeForce. ## 4. Target Users - AI researchers and developers - Privacy-conscious organizations - Academic institutions - AI prototyping teams - Independent developers building AI applications ## Key Relationships - Compute provided by: [[gb10-superchip]] - Memory constraints: [[memory-and-storage]] - Frameworks used: [[ai-frameworks]] - Scaling beyond single unit: [[multi-unit-stacking]]