---
id: ai-workloads
title: "AI Workloads and Model Capabilities"
status: established
source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks"
related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking]
key_equations: [model-memory-estimate]
key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count, lora, qlora, sft, fsdp, speculative-decoding, nemotron, comfyui, rapids-singlecell]
images: []
examples: [llm-memory-estimation.md]
open_questions:
  - "Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far)"
  - "Maximum batch size for inference at various model sizes"
  - "Training from scratch — is it practical for any meaningful model size?"
  - "Speculative decoding speedup factor (tokens/sec improvement not quantified yet)"
---

# AI Workloads and Model Capabilities

The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tuning**, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor.

## 1. Headline Capabilities

- **Up to 200 billion parameter models** locally (with quantization)
- **1 PFLOP (1,000 TFLOPS)** at FP4 precision with sparsity
- **Llama 3.3 70B** confirmed to run locally (single unit)
- **Up to 405B parameter models** with two-unit stacking (see [[multi-unit-stacking]])

## 1a. Measured Performance (T2 Benchmarked)

| Model / Workload       | Performance        | Precision | Source         |
|-----------------------|-------------------|-----------|----------------|
| Llama 3.2 3B          | ~100 tokens/sec   | —         | Jeff Geerling  |
| GPT-OSS-120B          | ~14.5 tokens/sec  | INT4      | ServeTheHome   |
| Llama 3.1 70B         | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling |
| Nemotron-3-Nano 30B   | Runs (MoE, 3B active) | Q8_K    | build.nvidia.com/spark |
| HPL (Linpack) FP64    | ~675 GFLOPS       | FP64      | Jeff Geerling  |
| Geekbench 6           | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling |

**Prompt processing** is noted as a particular strength of the system (T2, Jeff Geerling).

**INT4 inference** on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome).

**Nemotron-3-Nano 30B** is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark)

## 2. Model Size vs. Memory

With 128 GB of unified memory, the system can hold:

| Precision | Bytes/Param | Max Params (approx) | Example Models            |
|-----------|-------------|----------------------|---------------------------|
| FP4       | 0.5 B       | ~200B+               | Large quantized models     |
| FP8/INT8  | 1 B         | ~100B                | Llama 3.3 70B, Mixtral    |
| FP16      | 2 B         | ~50-55B              | Medium models at full prec |
| FP32      | 4 B         | ~25-28B              | Small models, debugging    |

*Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.*

## 3. Primary Use Cases

### Local LLM Inference
- Run large language models privately, no cloud dependency
- Interactive chat, code generation, document analysis
- Privacy-sensitive applications (medical, legal, financial)

### Fine-Tuning (T1 Documented, build.nvidia.com/spark)

NVIDIA provides official fine-tuning scripts with four approaches:

| Script | Model | Method | Notes |
|--------|-------|--------|-------|
| Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 |
| LoRA | Llama 3.1 8B | Parameter-efficient adapters | `lora_rank=8` default |
| LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm |
| QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit |

- Container: `nvcr.io/nvidia/pytorch:25.11-py3`
- Dependencies: `transformers`, `peft`, `datasets`, `trl`, `bitsandbytes`
- Key params: `--batch_size`, `--seq_length` (default 2048), `--num_epochs`, `--gradient_checkpointing`
- Dataset: Alpaca (configurable `--dataset_size`, default 512 samples)
- Multi-node: Docker Swarm + FSDP for 2-unit distributed training

### AI Prototyping
- Rapid iteration on model architectures
- Dataset preprocessing with RAPIDS
- Experiment tracking and evaluation

### Image Generation (T1 Documented, build.nvidia.com/spark)
- **ComfyUI** confirmed working — node-based UI for Stable Diffusion, SDXL, Flux
- Runs natively on Blackwell GPU with CUDA 13.0
- See [[ai-frameworks]] §4 for setup details

### Speculative Decoding (T1 Documented, build.nvidia.com/spark)
- Accelerates LLM inference by using a small draft model to predict tokens verified by the large model
- **EAGLE-3:** Built-in drafting head (no separate model needed)
- **Draft-Target:** Pairs 8B draft + 70B target with FP4 quantization
- Uses TensorRT-LLM container (`tensorrt-llm/release:1.2.0rc6`)
- Configurable `max_draft_len` (1-8 tokens) and KV cache memory fraction

### Data Science
- GPU-accelerated analytics with RAPIDS
- Large-scale data processing
- Graph analytics

### Scientific Computing (T1 Documented, build.nvidia.com/spark)

**Single-cell RNA Sequencing:**
- RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API)
- Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s)
- Requires ~40 GB unified memory
- Computes exact nearest-neighbor graph (vs. Scanpy's approximate)

**Portfolio Optimization:**
- cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE
- Mean-CVaR (Conditional Value-at-Risk) modeling
- Full pipeline in ~7 minutes

### Gaming (bonus, not primary use case)
Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton:
- Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling)
- Doom Eternal: ~200 fps (T2, Jeff Geerling)

**Not recommended as a gaming machine** — this is a development tool, not a GeForce.

## 4. Target Users

- AI researchers and developers
- Privacy-conscious organizations
- Academic institutions
- AI prototyping teams
- Independent developers building AI applications

## Key Relationships

- Compute provided by: [[gb10-superchip]]
- Memory constraints: [[memory-and-storage]]
- Frameworks used: [[ai-frameworks]]
- Scaling beyond single unit: [[multi-unit-stacking]]