--- id: ai-workloads title: "AI Workloads and Model Capabilities" status: established source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech" related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking] key_equations: [model-memory-estimate] key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count] images: [] examples: [llm-memory-estimation.md] open_questions: - "Actual tokens/sec benchmarks for common models (Llama 3.3 70B, Mixtral, etc.)" - "Maximum batch size for inference at various model sizes" - "Fine-tuning performance — how long to SFT a 7B model on this hardware?" - "Stable Diffusion / image generation performance" - "Training from scratch — is it practical for any meaningful model size?" --- # AI Workloads and Model Capabilities The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tuning**, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor. ## 1. Headline Capabilities - **Up to 200 billion parameter models** locally (with quantization) - **1 PFLOP (1,000 TFLOPS)** at FP4 precision - **Llama 3.3 70B** confirmed to run locally (single unit) - **Up to 400B parameter models** with two-unit stacking (see [[multi-unit-stacking]]) ## 2. Model Size vs. Memory With 128 GB of unified memory, the system can hold: | Precision | Bytes/Param | Max Params (approx) | Example Models | |-----------|-------------|----------------------|---------------------------| | FP4 | 0.5 B | ~200B+ | Large quantized models | | FP8/INT8 | 1 B | ~100B | Llama 3.3 70B, Mixtral | | FP16 | 2 B | ~50-55B | Medium models at full prec | | FP32 | 4 B | ~25-28B | Small models, debugging | *Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.* ## 3. Primary Use Cases ### Local LLM Inference - Run large language models privately, no cloud dependency - Interactive chat, code generation, document analysis - Privacy-sensitive applications (medical, legal, financial) ### Fine-Tuning - Supervised fine-tuning (SFT) of models using NVIDIA NeMo - LoRA/QLoRA for parameter-efficient fine-tuning of larger models - Custom domain adaptation ### AI Prototyping - Rapid iteration on model architectures - Dataset preprocessing with RAPIDS - Experiment tracking and evaluation ### Data Science - GPU-accelerated analytics with RAPIDS - Large-scale data processing - Graph analytics ## 4. Target Users - AI researchers and developers - Privacy-conscious organizations - Academic institutions - AI prototyping teams - Independent developers building AI applications ## Key Relationships - Compute provided by: [[gb10-superchip]] - Memory constraints: [[memory-and-storage]] - Frameworks used: [[ai-frameworks]] - Scaling beyond single unit: [[multi-unit-stacking]]