Browse Source

Integrate NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Ollama, ComfyUI

Phase 4: Parsed all 9 playbooks from build.nvidia.com/spark.
Key findings: CUDA compute capability sm_121, toolkit 13.0, TensorRT-LLM
confirmed, fine-tuning scripts (SFT/LoRA/QLoRA up to 70B), Nemotron-3-Nano
30B MoE, speculative decoding (EAGLE-3/Draft-Target), ComfyUI image gen,
Ollama+Open WebUI, RAPIDS scientific computing, DGX Dashboard on port 11000,
NVIDIA Sync full documentation. 11 questions resolved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
master
Joe DiPrima 1 month ago
parent
commit
fbdcc807b3
  1. 1
      CLAUDE.md
  2. 67
      context/ai-frameworks.md
  3. 55
      context/ai-workloads.md
  4. 24
      context/dgx-os-software.md
  5. 2
      context/equations-and-bounds.md
  6. 1
      context/gb10-superchip.md
  7. 17
      context/open-questions.md
  8. 70
      phases/phase-04-spark-playbooks.md

1
CLAUDE.md

@ -167,3 +167,4 @@ Dell Pro Max GB10 (product)
| 1 | 2026-02-14 | Initial knowledge base created from web research |
| 2 | 2026-02-14 | Deep research: NVIDIA docs, reviews, 18 questions resolved |
| 3 | 2026-02-14 | Dell Owner's Manual (Rev A01) integrated, critical corrections applied |
| 4 | 2026-02-14 | NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Sync, Dashboard, ComfyUI, Ollama |

67
context/ai-frameworks.md

@ -2,10 +2,10 @@
id: ai-frameworks
title: "AI Frameworks and Development Tools"
status: established
source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide"
source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide, build.nvidia.com/spark playbooks"
related_topics: [dgx-os-software, gb10-superchip, ai-workloads]
key_equations: []
key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, llama-cpp, docker, nvidia-container-runtime, fex]
key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, tensorrt-llm, llama-cpp, docker, nvidia-container-runtime, fex, ollama, comfyui, sm_121, cu130, speculative-decoding]
images: []
examples: []
open_questions:
@ -38,31 +38,74 @@ The Dell Pro Max GB10 supports a broad AI software ecosystem, pre-configured thr
## 2. Inference Tools
### CUDA Toolkit
- Low-level GPU compute API
- Compiler (nvcc) for custom CUDA kernels
- Profiling and debugging tools
### CUDA Toolkit (v13.0)
- **CUDA compute capability:** `sm_121` (Blackwell on GB10) — use `-DCMAKE_CUDA_ARCHITECTURES="121"` when compiling
- **PyTorch CUDA wheels:** `cu130` (e.g., `pip3 install torch --index-url https://download.pytorch.org/whl/cu130`)
- Low-level GPU compute API, compiler (nvcc), profiling and debugging tools
### llama.cpp
- Quantized LLM inference engine
- ARM-optimized builds available for GB10
- Supports GGUF model format
- Build with CUDA: `cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121"` (T1, build.nvidia.com/spark)
- Provides **OpenAI-compatible API** via `llama-server` (chat completions, streaming, function calling)
- Documented in [Arm Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/)
### TensorRT (expected)
- NVIDIA's inference optimizer
- Blackwell architecture support expected
### TensorRT-LLM
- NVIDIA's LLM inference optimizer — **confirmed available** (T1, build.nvidia.com/spark)
- Container: `tensorrt-llm/release:1.2.0rc6`
- Supports **speculative decoding** for faster inference:
- **EAGLE-3:** Built-in drafting head, no separate draft model needed
- **Draft-Target:** Pairs small (8B) and large (70B) models, uses FP4 quantization
- Configurable KV cache memory fraction for memory management
### Ollama
- LLM runtime with model library — runs via Docker on GB10 (T1, build.nvidia.com/spark)
- Container: `ghcr.io/open-webui/open-webui:ollama` (bundles Open WebUI + Ollama)
- Models available from ollama.com/library (e.g., `gpt-oss:20b`)
- Port: 12000 (via NVIDIA Sync) or 8080 (direct)
## 3. Development Environment
- **DGX Dashboard** — web-based system monitor with integrated JupyterLab (T0 Spec)
- **DGX Dashboard** — web-based system monitor at `http://localhost:11000` with integrated JupyterLab (T0 Spec). JupyterLab ports configured in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`.
- **VS Code** — ARM64 .deb available; also remote SSH via NVIDIA Sync or manual SSH (T1, build.nvidia.com/spark)
- **Cursor** — supported via NVIDIA Sync remote SSH launch (T1, build.nvidia.com/spark)
- **NVIDIA AI Workbench** — launchable via NVIDIA Sync (T1, build.nvidia.com/spark)
- **Python** — system Python with AI/ML package ecosystem
- **NVIDIA NGC Catalog** — library of pre-trained models, containers, and SDKs
- **Docker + NVIDIA Container Runtime** — pre-installed for containerized workflows (T0 Spec)
- **NVIDIA AI Enterprise** — enterprise-grade AI software and services
- **Tutorials:** https://build.nvidia.com/spark
- **Tutorials & Playbooks:** https://build.nvidia.com/spark
### Key NGC Containers (confirmed ARM64)
| Container | Tag | Use Case |
|-----------|-----|----------|
| `nvcr.io/nvidia/pytorch` | `25.11-py3` | PyTorch training & fine-tuning |
| `tensorrt-llm/release` | `1.2.0rc6` | Optimized LLM inference |
| RAPIDS | `25.10` | GPU-accelerated data science |
| `ghcr.io/open-webui/open-webui` | `ollama` | Open WebUI + Ollama LLM chat |
## 4. Image Generation
### ComfyUI
- Node-based image generation UI for Stable Diffusion, SDXL, Flux, etc. (T1, build.nvidia.com/spark)
- Runs natively on GB10 Blackwell GPU
- Requires: Python 3.8+, CUDA toolkit, PyTorch with `cu130`
- Port: 8188 (`--listen 0.0.0.0` for remote access)
- Storage: ~20 GB minimum (plus model files, e.g., SD 1.5 ~2 GB)
## 5. UMA Memory Management Tip
DGX Spark uses Unified Memory Architecture (UMA) — CPU and GPU share the same LPDDR5X pool. If GPU memory appears low due to filesystem buffer cache:
```bash
sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
```
This frees cached memory back to the unified pool without data loss. (T1, build.nvidia.com/spark)
## 4. Software Compatibility Notes
## 6. Software Compatibility Notes
Since the GB10 is an ARM system:

55
context/ai-workloads.md

@ -2,18 +2,17 @@
id: ai-workloads
title: "AI Workloads and Model Capabilities"
status: established
source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware"
source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks"
related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking]
key_equations: [model-memory-estimate]
key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count]
key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count, lora, qlora, sft, fsdp, speculative-decoding, nemotron, comfyui, rapids-singlecell]
images: []
examples: [llm-memory-estimation.md]
open_questions:
- "Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far)"
- "Maximum batch size for inference at various model sizes"
- "Fine-tuning performance — how long to SFT a 7B model on this hardware?"
- "Stable Diffusion / image generation performance"
- "Training from scratch — is it practical for any meaningful model size?"
- "Speculative decoding speedup factor (tokens/sec improvement not quantified yet)"
---
# AI Workloads and Model Capabilities
@ -34,6 +33,7 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu
| Llama 3.2 3B | ~100 tokens/sec | — | Jeff Geerling |
| GPT-OSS-120B | ~14.5 tokens/sec | INT4 | ServeTheHome |
| Llama 3.1 70B | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling |
| Nemotron-3-Nano 30B | Runs (MoE, 3B active) | Q8_K | build.nvidia.com/spark |
| HPL (Linpack) FP64 | ~675 GFLOPS | FP64 | Jeff Geerling |
| Geekbench 6 | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling |
@ -41,6 +41,8 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu
**INT4 inference** on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome).
**Nemotron-3-Nano 30B** is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark)
## 2. Model Size vs. Memory
With 128 GB of unified memory, the system can hold:
@ -61,21 +63,58 @@ With 128 GB of unified memory, the system can hold:
- Interactive chat, code generation, document analysis
- Privacy-sensitive applications (medical, legal, financial)
### Fine-Tuning
- Supervised fine-tuning (SFT) of models using NVIDIA NeMo
- LoRA/QLoRA for parameter-efficient fine-tuning of larger models
- Custom domain adaptation
### Fine-Tuning (T1 Documented, build.nvidia.com/spark)
NVIDIA provides official fine-tuning scripts with four approaches:
| Script | Model | Method | Notes |
|--------|-------|--------|-------|
| Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 |
| LoRA | Llama 3.1 8B | Parameter-efficient adapters | `lora_rank=8` default |
| LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm |
| QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit |
- Container: `nvcr.io/nvidia/pytorch:25.11-py3`
- Dependencies: `transformers`, `peft`, `datasets`, `trl`, `bitsandbytes`
- Key params: `--batch_size`, `--seq_length` (default 2048), `--num_epochs`, `--gradient_checkpointing`
- Dataset: Alpaca (configurable `--dataset_size`, default 512 samples)
- Multi-node: Docker Swarm + FSDP for 2-unit distributed training
### AI Prototyping
- Rapid iteration on model architectures
- Dataset preprocessing with RAPIDS
- Experiment tracking and evaluation
### Image Generation (T1 Documented, build.nvidia.com/spark)
- **ComfyUI** confirmed working — node-based UI for Stable Diffusion, SDXL, Flux
- Runs natively on Blackwell GPU with CUDA 13.0
- See [[ai-frameworks]] §4 for setup details
### Speculative Decoding (T1 Documented, build.nvidia.com/spark)
- Accelerates LLM inference by using a small draft model to predict tokens verified by the large model
- **EAGLE-3:** Built-in drafting head (no separate model needed)
- **Draft-Target:** Pairs 8B draft + 70B target with FP4 quantization
- Uses TensorRT-LLM container (`tensorrt-llm/release:1.2.0rc6`)
- Configurable `max_draft_len` (1-8 tokens) and KV cache memory fraction
### Data Science
- GPU-accelerated analytics with RAPIDS
- Large-scale data processing
- Graph analytics
### Scientific Computing (T1 Documented, build.nvidia.com/spark)
**Single-cell RNA Sequencing:**
- RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API)
- Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s)
- Requires ~40 GB unified memory
- Computes exact nearest-neighbor graph (vs. Scanpy's approximate)
**Portfolio Optimization:**
- cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE
- Mean-CVaR (Conditional Value-at-Risk) modeling
- Full pipeline in ~7 minutes
### Gaming (bonus, not primary use case)
Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton:
- Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling)

24
context/dgx-os-software.md

@ -41,7 +41,7 @@ The system ships ready to run AI workloads with:
- **NVIDIA drivers** — optimized for GB10 Blackwell GPU
- **Docker + NVIDIA Container Runtime** — container support out of the box (T0 Spec)
- **NVIDIA Sync** — cross-platform desktop app for remote device management (see §8)
- **DGX Dashboard** — system monitoring with integrated JupyterLab
- **DGX Dashboard** — system monitoring web UI at `http://localhost:11000` with integrated JupyterLab (see §9)
- **NGC** — access to NVIDIA GPU Cloud containerized applications and models
- **AI Enterprise** — enterprise-grade AI software assets and services
- **Python** — system Python plus development environments
@ -145,6 +145,28 @@ NVIDIA Sync is a **cross-platform desktop application** (macOS, Windows, Linux)
- **Connection timeout during boot:** Wait for device to fully boot
- **Authentication failure:** Reconfigure connection in Sync app
## 9. DGX Dashboard (T1 Documented, build.nvidia.com/spark)
DGX Dashboard is a locally-hosted web application for system management and development.
### Access
- **Local:** `http://localhost:11000` or desktop shortcut in Ubuntu app launcher
- **Remote via NVIDIA Sync:** Automatic SSH tunnel (recommended)
- **Remote via manual SSH:** `ssh -L 11000:localhost:11000 user@spark-ip`
For JupyterLab remote access, also forward the user-specific port from:
`/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`
### Features
- **GPU/system monitoring** — real-time resource utilization panels and telemetry
- **JupyterLab** — one-click launch with pre-configured Python virtual environments
- Working directory: `/home/<USERNAME>/jupyterlab`
- Requirements tracking via `requirements.txt`
- **System updates** — package and firmware update management via GUI
- **Settings** — system configuration interface
## Key Relationships
- Runs on: [[gb10-superchip]]

2
context/equations-and-bounds.md

@ -38,6 +38,8 @@ Reference for all quantitative specifications, formulas, and validation ranges f
- **Copy engines:** 2 (T0 Spec)
- **NVENC:** 1 (T0 Spec)
- **NVDEC:** 1 (T0 Spec)
- **CUDA compute capability:** sm_121 (T1, build.nvidia.com/spark)
- **CUDA toolkit:** 13.0 / cu130 (T1, build.nvidia.com/spark)
## 2. Memory

1
context/gb10-superchip.md

@ -48,6 +48,7 @@ The Blackwell GPU portion features:
- **4th-generation RT Cores** — ray tracing acceleration (T0 Spec)
- **1x NVENC / 1x NVDEC** — hardware video encode/decode engines (T0 Spec)
- **2 copy engines** (T0 Spec)
- **CUDA compute capability:** `sm_121` (T1 Documented, build.nvidia.com/spark — required when compiling CUDA kernels with `-DCMAKE_CUDA_ARCHITECTURES="121"`)
- Peak performance: **1 PFLOP (1,000 TFLOPS) at FP4 precision with sparsity**
The Tensor Cores are the key differentiator for AI workloads, providing hardware acceleration for mixed-precision matrix operations used in deep learning.

17
context/open-questions.md

@ -86,11 +86,14 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del
- *Status:* Only Llama 3.2 3B (~100 tok/s) and GPT-OSS-120B (~14.5 tok/s) benchmarked.
- *Would resolve:* Most common use case performance
- **Q:** Fine-tuning time estimates for common model sizes?
- *Status:* Unknown.
- *Status:* Partially resolved — scripts and methods documented (Full SFT 3B, LoRA 8B, QLoRA 70B) but wall-clock times not published.
- *Would resolve:* Training workflow planning
- **Q:** Stable Diffusion / image generation performance?
- *Status:* Unknown.
- *Status:* **Partially resolved** — ComfyUI confirmed working with SD 1.5. Quantitative benchmarks (images/sec) not published.
- *Would resolve:* Non-LLM AI workload suitability
- **Q:** Speculative decoding speedup factor?
- *Status:* EAGLE-3 and Draft-Target methods documented. Quantitative speedup (tokens/sec improvement) not published.
- *Would resolve:* Inference optimization ROI
---
@ -130,3 +133,13 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del
| 2026-02-14 | Power adapter dimensions? | 23 x 78 x 162 mm, multi-voltage output (5V-48V) | Dell Owner's Manual Rev A01 |
| 2026-02-14 | USB-C MST support? | Not supported (single display per port only) | Dell Owner's Manual Rev A01 |
| 2026-02-14 | Service tools required? | Phillips #0, T5 or T8 Torx screwdriver | Dell Owner's Manual Rev A01 |
| 2026-02-14 | CUDA compute capability / SM architecture? | sm_121 (compile with `-DCMAKE_CUDA_ARCHITECTURES="121"`) | build.nvidia.com/spark |
| 2026-02-14 | CUDA toolkit version? | CUDA 13.0 (PyTorch wheels: cu130) | build.nvidia.com/spark |
| 2026-02-14 | DGX Dashboard URL/port? | `http://localhost:11000` | build.nvidia.com/spark |
| 2026-02-14 | TensorRT-LLM availability? | Confirmed — container `tensorrt-llm/release:1.2.0rc6` | build.nvidia.com/spark |
| 2026-02-14 | Fine-tuning methods supported? | Full SFT (3B), LoRA (8B), QLoRA 4-bit (70B), FSDP multi-node | build.nvidia.com/spark |
| 2026-02-14 | Image generation support? | ComfyUI confirmed (SD, SDXL, Flux) on port 8188 | build.nvidia.com/spark |
| 2026-02-14 | Ollama / Open WebUI support? | Yes — Docker container, port 12000 (Sync) or 8080 (direct) | build.nvidia.com/spark |
| 2026-02-14 | NVIDIA Sync details? | Cross-platform app, SSH key automation, VS Code/Cursor/Dashboard launch, port forwarding | build.nvidia.com/spark |
| 2026-02-14 | PyTorch NGC container? | `nvcr.io/nvidia/pytorch:25.11-py3` (ARM64) | build.nvidia.com/spark |
| 2026-02-14 | Speculative decoding methods? | EAGLE-3 (built-in drafting) and Draft-Target (8B+70B) | build.nvidia.com/spark |

70
phases/phase-04-spark-playbooks.md

@ -0,0 +1,70 @@
# Phase 4: NVIDIA Spark Playbooks Integration
**Date:** 2026-02-14
**Goal:** Integrate official NVIDIA playbooks from build.nvidia.com/spark into knowledge base
## Source
- https://build.nvidia.com/spark (main page, 9 playbooks + connection guide)
## Key Discoveries
### Critical Technical Facts (previously unknown)
1. **CUDA compute capability: `sm_121`** — required for compiling CUDA kernels on Blackwell GB10 (`-DCMAKE_CUDA_ARCHITECTURES="121"`)
2. **CUDA toolkit version: 13.0** — PyTorch wheels use `cu130` index
3. **DGX Dashboard runs on port 11000** — JupyterLab ports in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`
4. **TensorRT-LLM confirmed** — container `tensorrt-llm/release:1.2.0rc6`
5. **PyTorch NGC container:** `nvcr.io/nvidia/pytorch:25.11-py3`
6. **RAPIDS container:** version 25.10
7. **UMA buffer cache flush:** `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
### Fine-Tuning (fully documented)
- **Full SFT:** Llama 3.2 3B (all parameters, bfloat16)
- **LoRA:** Llama 3.1 8B (rank 8 default)
- **LoRA + FSDP:** Llama 3.1 70B (multi-node via Docker Swarm)
- **QLoRA 4-bit:** Llama 3.1 70B (single unit)
- Dependencies: transformers, peft, datasets, trl, bitsandbytes
### Inference Tools
- **llama.cpp:** Build with CUDA sm_121, provides OpenAI-compatible API (streaming, function calling)
- **Nemotron-3-Nano 30B:** MoE (3B active), ~38 GB at Q8, built-in reasoning/tool-calling
- **Speculative Decoding:** EAGLE-3 (built-in drafting) and Draft-Target (8B+70B, FP4)
- **Ollama + Open WebUI:** Docker container, ports 12000 (Sync) or 8080 (direct)
### Image Generation
- **ComfyUI** confirmed working (SD, SDXL, Flux) on port 8188
- Native Blackwell GPU acceleration with CUDA 13.0
### Scientific Computing
- **scRNA-seq:** RAPIDS-singlecell, ~130s full pipeline, exact nearest-neighbor graph
- **Portfolio Optimization:** cuOpt + cuML, Mean-CVaR model, ~7 min pipeline
### Development Environment
- **VS Code:** ARM64 .deb install or remote SSH via Sync
- **Cursor:** Remote SSH via Sync
- **NVIDIA AI Workbench:** Launchable via Sync
- **NVIDIA Sync:** Full details documented (SSH key automation, mDNS, port forwarding)
## Files Updated
- `context/gb10-superchip.md` — sm_121 CUDA architecture
- `context/ai-frameworks.md` — Major expansion: CUDA 13.0, TensorRT-LLM, Ollama, ComfyUI, NGC containers, UMA tip
- `context/ai-workloads.md` — Fine-tuning scripts, Nemotron, speculative decoding, image gen, scientific computing
- `context/dgx-os-software.md` — NVIDIA Sync §8 (full detail), DGX Dashboard §9 (port, features)
- `context/setup-and-config.md` — NVIDIA Sync cross-reference
- `context/equations-and-bounds.md` — sm_121, CUDA 13.0
- `context/open-questions.md` — 11 new resolved questions, 1 new open question
- `CLAUDE.md` — Phase 4 added to history
## Remaining Gaps
- Quantitative speculative decoding speedup (tokens/sec improvement not published)
- ComfyUI image generation benchmarks (images/sec)
- Fine-tuning wall-clock times
- Full list of Ollama-compatible models tested on GB10
Loading…
Cancel
Save