---
id: ai-workloads
title: "AI Workloads and Model Capabilities"
status: established
source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech"
related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking]
key_equations: [model-memory-estimate]
key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count]
images: []
examples: [llm-memory-estimation.md]
open_questions:
  - "Actual tokens/sec benchmarks for common models (Llama 3.3 70B, Mixtral, etc.)"
  - "Maximum batch size for inference at various model sizes"
  - "Fine-tuning performance — how long to SFT a 7B model on this hardware?"
  - "Stable Diffusion / image generation performance"
  - "Training from scratch — is it practical for any meaningful model size?"
---

# AI Workloads and Model Capabilities

The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tuning**, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor.

## 1. Headline Capabilities

- **Up to 200 billion parameter models** locally (with quantization)
- **1 PFLOP (1,000 TFLOPS)** at FP4 precision
- **Llama 3.3 70B** confirmed to run locally (single unit)
- **Up to 400B parameter models** with two-unit stacking (see [[multi-unit-stacking]])

## 2. Model Size vs. Memory

With 128 GB of unified memory, the system can hold:

| Precision | Bytes/Param | Max Params (approx) | Example Models            |
|-----------|-------------|----------------------|---------------------------|
| FP4       | 0.5 B       | ~200B+               | Large quantized models     |
| FP8/INT8  | 1 B         | ~100B                | Llama 3.3 70B, Mixtral    |
| FP16      | 2 B         | ~50-55B              | Medium models at full prec |
| FP32      | 4 B         | ~25-28B              | Small models, debugging    |

*Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.*

## 3. Primary Use Cases

### Local LLM Inference
- Run large language models privately, no cloud dependency
- Interactive chat, code generation, document analysis
- Privacy-sensitive applications (medical, legal, financial)

### Fine-Tuning
- Supervised fine-tuning (SFT) of models using NVIDIA NeMo
- LoRA/QLoRA for parameter-efficient fine-tuning of larger models
- Custom domain adaptation

### AI Prototyping
- Rapid iteration on model architectures
- Dataset preprocessing with RAPIDS
- Experiment tracking and evaluation

### Data Science
- GPU-accelerated analytics with RAPIDS
- Large-scale data processing
- Graph analytics

## 4. Target Users

- AI researchers and developers
- Privacy-conscious organizations
- Academic institutions
- AI prototyping teams
- Independent developers building AI applications

## Key Relationships

- Compute provided by: [[gb10-superchip]]
- Memory constraints: [[memory-and-storage]]
- Frameworks used: [[ai-frameworks]]
- Scaling beyond single unit: [[multi-unit-stacking]]