6.4 KiB

Raw Permalink Blame History

id	title	status	source_sections	related_topics	key_equations	key_terms	images	examples	open_questions
ai-workloads	AI Workloads and Model Capabilities	established	Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks	[gb10-superchip memory-and-storage ai-frameworks multi-unit-stacking]	[model-memory-estimate]	[llm inference fine-tuning quantization fp4 fp8 fp16 parameter-count lora qlora sft fsdp speculative-decoding nemotron comfyui rapids-singlecell]	[]	[llm-memory-estimation.md]	[Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far) Maximum batch size for inference at various model sizes Training from scratch — is it practical for any meaningful model size? Speculative decoding speedup factor (tokens/sec improvement not quantified yet)]

AI Workloads and Model Capabilities

The Dell Pro Max GB10 is designed primarily for local AI inference and fine-tuning, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor.

1. Headline Capabilities

Up to 200 billion parameter models locally (with quantization)
1 PFLOP (1,000 TFLOPS) at FP4 precision with sparsity
Llama 3.3 70B confirmed to run locally (single unit)
Up to 405B parameter models with two-unit stacking (see multi-unit-stacking)

1a. Measured Performance (T2 Benchmarked)

Model / Workload	Performance	Precision	Source
Llama 3.2 3B	~100 tokens/sec	—	Jeff Geerling
GPT-OSS-120B	~14.5 tokens/sec	INT4	ServeTheHome
Llama 3.1 70B	Competitive w/ Ryzen AI Max+ 395	—	Jeff Geerling
Nemotron-3-Nano 30B	Runs (MoE, 3B active)	Q8_K	build.nvidia.com/spark
HPL (Linpack) FP64	~675 GFLOPS	FP64	Jeff Geerling
Geekbench 6	Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra	—	Jeff Geerling

Prompt processing is noted as a particular strength of the system (T2, Jeff Geerling).

INT4 inference on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome).

Nemotron-3-Nano 30B is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark)

2. Model Size vs. Memory

With 128 GB of unified memory, the system can hold:

Precision	Bytes/Param	Max Params (approx)	Example Models
FP4	0.5 B	~200B+	Large quantized models
FP8/INT8	1 B	~100B	Llama 3.3 70B, Mixtral
FP16	2 B	~50-55B	Medium models at full prec
FP32	4 B	~25-28B	Small models, debugging

Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.

3. Primary Use Cases

Local LLM Inference

Run large language models privately, no cloud dependency
Interactive chat, code generation, document analysis
Privacy-sensitive applications (medical, legal, financial)

Fine-Tuning (T1 Documented, build.nvidia.com/spark)

NVIDIA provides official fine-tuning scripts with four approaches:

Script	Model	Method	Notes
Full SFT	Llama 3.2 3B	All parameters trainable	Fits in memory at bfloat16
LoRA	Llama 3.1 8B	Parameter-efficient adapters	`lora_rank=8` default
LoRA + FSDP	Llama 3.1 70B	Distributed across 2 units	Multi-node via Docker Swarm
QLoRA (4-bit)	Llama 3.1 70B	Quantized base + LoRA	Fits on single unit

Container: nvcr.io/nvidia/pytorch:25.11-py3
Dependencies: transformers, peft, datasets, trl, bitsandbytes
Key params: --batch_size, --seq_length (default 2048), --num_epochs, --gradient_checkpointing
Dataset: Alpaca (configurable --dataset_size, default 512 samples)
Multi-node: Docker Swarm + FSDP for 2-unit distributed training

AI Prototyping

Rapid iteration on model architectures
Dataset preprocessing with RAPIDS
Experiment tracking and evaluation

Image Generation (T1 Documented, build.nvidia.com/spark)

ComfyUI confirmed working — node-based UI for Stable Diffusion, SDXL, Flux
Runs natively on Blackwell GPU with CUDA 13.0
See ai-frameworks §4 for setup details

Speculative Decoding (T1 Documented, build.nvidia.com/spark)

Accelerates LLM inference by using a small draft model to predict tokens verified by the large model
EAGLE-3: Built-in drafting head (no separate model needed)
Draft-Target: Pairs 8B draft + 70B target with FP4 quantization
Uses TensorRT-LLM container (tensorrt-llm/release:1.2.0rc6)
Configurable max_draft_len (1-8 tokens) and KV cache memory fraction

Data Science

GPU-accelerated analytics with RAPIDS
Large-scale data processing
Graph analytics

Scientific Computing (T1 Documented, build.nvidia.com/spark)

Single-cell RNA Sequencing:

RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API)
Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s)
Requires ~40 GB unified memory
Computes exact nearest-neighbor graph (vs. Scanpy's approximate)

Portfolio Optimization:

cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE
Mean-CVaR (Conditional Value-at-Risk) modeling
Full pipeline in ~7 minutes

Gaming (bonus, not primary use case)

Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton:

Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling)
Doom Eternal: ~200 fps (T2, Jeff Geerling)

Not recommended as a gaming machine — this is a development tool, not a GeForce.

4. Target Users

AI researchers and developers
Privacy-conscious organizations
Academic institutions
AI prototyping teams
Independent developers building AI applications

Key Relationships

Compute provided by: gb10-superchip
Memory constraints: memory-and-storage
Frameworks used: ai-frameworks
Scaling beyond single unit: multi-unit-stacking

6.4 KiB Raw Permalink Blame History