You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
3.0 KiB
3.0 KiB
| id | title | status | source_sections | related_topics | key_equations | key_terms | images | examples | open_questions |
|---|---|---|---|---|---|---|---|---|---|
| ai-workloads | AI Workloads and Model Capabilities | established | Web research: NVIDIA newsroom, Dell product page, WCCFTech | [gb10-superchip memory-and-storage ai-frameworks multi-unit-stacking] | [model-memory-estimate] | [llm inference fine-tuning quantization fp4 fp8 fp16 parameter-count] | [] | [llm-memory-estimation.md] | [Actual tokens/sec benchmarks for common models (Llama 3.3 70B, Mixtral, etc.) Maximum batch size for inference at various model sizes Fine-tuning performance — how long to SFT a 7B model on this hardware? Stable Diffusion / image generation performance Training from scratch — is it practical for any meaningful model size?] |
AI Workloads and Model Capabilities
The Dell Pro Max GB10 is designed primarily for local AI inference and fine-tuning, bringing capabilities previously requiring cloud or data center hardware to a desktop form factor.
1. Headline Capabilities
- Up to 200 billion parameter models locally (with quantization)
- 1 PFLOP (1,000 TFLOPS) at FP4 precision
- Llama 3.3 70B confirmed to run locally (single unit)
- Up to 400B parameter models with two-unit stacking (see multi-unit-stacking)
2. Model Size vs. Memory
With 128 GB of unified memory, the system can hold:
| Precision | Bytes/Param | Max Params (approx) | Example Models |
|---|---|---|---|
| FP4 | 0.5 B | ~200B+ | Large quantized models |
| FP8/INT8 | 1 B | ~100B | Llama 3.3 70B, Mixtral |
| FP16 | 2 B | ~50-55B | Medium models at full prec |
| FP32 | 4 B | ~25-28B | Small models, debugging |
Note: Actual usable capacity is less than 128 GB due to OS, KV cache, framework overhead, and activation memory. Estimates assume ~85-90% of memory available for model weights.
3. Primary Use Cases
Local LLM Inference
- Run large language models privately, no cloud dependency
- Interactive chat, code generation, document analysis
- Privacy-sensitive applications (medical, legal, financial)
Fine-Tuning
- Supervised fine-tuning (SFT) of models using NVIDIA NeMo
- LoRA/QLoRA for parameter-efficient fine-tuning of larger models
- Custom domain adaptation
AI Prototyping
- Rapid iteration on model architectures
- Dataset preprocessing with RAPIDS
- Experiment tracking and evaluation
Data Science
- GPU-accelerated analytics with RAPIDS
- Large-scale data processing
- Graph analytics
4. Target Users
- AI researchers and developers
- Privacy-conscious organizations
- Academic institutions
- AI prototyping teams
- Independent developers building AI applications
Key Relationships
- Compute provided by: gb10-superchip
- Memory constraints: memory-and-storage
- Frameworks used: ai-frameworks
- Scaling beyond single unit: multi-unit-stacking