Integrate NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Ollama, ComfyUI

Phase 4: Parsed all 9 playbooks from build.nvidia.com/spark. Key findings: CUDA compute capability sm_121, toolkit 13.0, TensorRT-LLM confirmed, fine-tuning scripts (SFT/LoRA/QLoRA up to 70B), Nemotron-3-Nano 30B MoE, speculative decoding (EAGLE-3/Draft-Target), ComfyUI image gen, Ollama+Open WebUI, RAPIDS scientific computing, DGX Dashboard on port 11000, NVIDIA Sync full documentation. 11 questions resolved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 month ago · fbdcc807b3
8 changed files with 214 additions and 23 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@ -167,3 +167,4 @@ Dell Pro Max GB10 (product)
 | 1     | 2026-02-14 | Initial knowledge base created from web research     |
 | 2     | 2026-02-14 | Deep research: NVIDIA docs, reviews, 18 questions resolved |
 | 3     | 2026-02-14 | Dell Owner's Manual (Rev A01) integrated, critical corrections applied |
+| 4     | 2026-02-14 | NVIDIA Spark playbooks: CUDA sm_121, TensorRT-LLM, fine-tuning, Sync, Dashboard, ComfyUI, Ollama |
--- a/context/ai-frameworks.md
+++ b/context/ai-frameworks.md
@ -2,10 +2,10 @@
 id: ai-frameworks
 title: "AI Frameworks and Development Tools"
 status: established
-source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide"
+source_sections: "Web research: NVIDIA newsroom, Arm learning paths, NVIDIA DGX Spark User Guide, build.nvidia.com/spark playbooks"
 related_topics: [dgx-os-software, gb10-superchip, ai-workloads]
 key_equations: []
-key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, llama-cpp, docker, nvidia-container-runtime, fex]
+key_terms: [pytorch, nemo, rapids, cuda, ngc, jupyter, tensorrt, tensorrt-llm, llama-cpp, docker, nvidia-container-runtime, fex, ollama, comfyui, sm_121, cu130, speculative-decoding]
 images: []
 examples: []
 open_questions:
@ -38,31 +38,74 @@ The Dell Pro Max GB10 supports a broad AI software ecosystem, pre-configured thr

 ## 2. Inference Tools

-### CUDA Toolkit
- Low-level GPU compute API
- Compiler (nvcc) for custom CUDA kernels
- Profiling and debugging tools
+### CUDA Toolkit (v13.0)
+- **CUDA compute capability:** `sm_121` (Blackwell on GB10) — use `-DCMAKE_CUDA_ARCHITECTURES="121"` when compiling
+- **PyTorch CUDA wheels:** `cu130` (e.g., `pip3 install torch --index-url https://download.pytorch.org/whl/cu130`)
+- Low-level GPU compute API, compiler (nvcc), profiling and debugging tools

 ### llama.cpp
 - Quantized LLM inference engine
 - ARM-optimized builds available for GB10
 - Supports GGUF model format
+- Build with CUDA: `cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="121"` (T1, build.nvidia.com/spark)
+- Provides **OpenAI-compatible API** via `llama-server` (chat completions, streaming, function calling)
 - Documented in [Arm Learning Path](https://learn.arm.com/learning-paths/laptops-and-desktops/dgx_spark_llamacpp/)

-### TensorRT (expected)
- NVIDIA's inference optimizer
- Blackwell architecture support expected
+### TensorRT-LLM
+- NVIDIA's LLM inference optimizer — **confirmed available** (T1, build.nvidia.com/spark)
+- Container: `tensorrt-llm/release:1.2.0rc6`
+- Supports **speculative decoding** for faster inference:
+  - **EAGLE-3:** Built-in drafting head, no separate draft model needed
+  - **Draft-Target:** Pairs small (8B) and large (70B) models, uses FP4 quantization
+- Configurable KV cache memory fraction for memory management
+
+### Ollama
+- LLM runtime with model library — runs via Docker on GB10 (T1, build.nvidia.com/spark)
+- Container: `ghcr.io/open-webui/open-webui:ollama` (bundles Open WebUI + Ollama)
+- Models available from ollama.com/library (e.g., `gpt-oss:20b`)
+- Port: 12000 (via NVIDIA Sync) or 8080 (direct)

 ## 3. Development Environment

- **DGX Dashboard** — web-based system monitor with integrated JupyterLab (T0 Spec)
+- **DGX Dashboard** — web-based system monitor at `http://localhost:11000` with integrated JupyterLab (T0 Spec). JupyterLab ports configured in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`.
+- **VS Code** — ARM64 .deb available; also remote SSH via NVIDIA Sync or manual SSH (T1, build.nvidia.com/spark)
+- **Cursor** — supported via NVIDIA Sync remote SSH launch (T1, build.nvidia.com/spark)
+- **NVIDIA AI Workbench** — launchable via NVIDIA Sync (T1, build.nvidia.com/spark)
 - **Python** — system Python with AI/ML package ecosystem
 - **NVIDIA NGC Catalog** — library of pre-trained models, containers, and SDKs
 - **Docker + NVIDIA Container Runtime** — pre-installed for containerized workflows (T0 Spec)
 - **NVIDIA AI Enterprise** — enterprise-grade AI software and services
- **Tutorials:** https://build.nvidia.com/spark
+- **Tutorials & Playbooks:** https://build.nvidia.com/spark
+
+### Key NGC Containers (confirmed ARM64)
+
+| Container | Tag | Use Case |
+|-----------|-----|----------|
+| `nvcr.io/nvidia/pytorch` | `25.11-py3` | PyTorch training & fine-tuning |
+| `tensorrt-llm/release` | `1.2.0rc6` | Optimized LLM inference |
+| RAPIDS | `25.10` | GPU-accelerated data science |
+| `ghcr.io/open-webui/open-webui` | `ollama` | Open WebUI + Ollama LLM chat |
+
+## 4. Image Generation
+
+### ComfyUI
+- Node-based image generation UI for Stable Diffusion, SDXL, Flux, etc. (T1, build.nvidia.com/spark)
+- Runs natively on GB10 Blackwell GPU
+- Requires: Python 3.8+, CUDA toolkit, PyTorch with `cu130`
+- Port: 8188 (`--listen 0.0.0.0` for remote access)
+- Storage: ~20 GB minimum (plus model files, e.g., SD 1.5 ~2 GB)
+
+## 5. UMA Memory Management Tip
+
+DGX Spark uses Unified Memory Architecture (UMA) — CPU and GPU share the same LPDDR5X pool. If GPU memory appears low due to filesystem buffer cache:
+
+```bash
+sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
+```
+
+This frees cached memory back to the unified pool without data loss. (T1, build.nvidia.com/spark)

-## 4. Software Compatibility Notes
+## 6. Software Compatibility Notes

 Since the GB10 is an ARM system:

--- a/context/ai-workloads.md
+++ b/context/ai-workloads.md
@ -2,18 +2,17 @@
 id: ai-workloads
 title: "AI Workloads and Model Capabilities"
 status: established
-source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware"
+source_sections: "Web research: NVIDIA newsroom, Dell product page, WCCFTech, Jeff Geerling, ServeTheHome, Tom's Hardware, build.nvidia.com/spark playbooks"
 related_topics: [gb10-superchip, memory-and-storage, ai-frameworks, multi-unit-stacking]
 key_equations: [model-memory-estimate]
-key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count]
+key_terms: [llm, inference, fine-tuning, quantization, fp4, fp8, fp16, parameter-count, lora, qlora, sft, fsdp, speculative-decoding, nemotron, comfyui, rapids-singlecell]
 images: []
 examples: [llm-memory-estimation.md]
 open_questions:
  - "Tokens/sec for Llama 3.3 70B specifically (only 3B and GPT-OSS-120B benchmarked so far)"
  - "Maximum batch size for inference at various model sizes"
-  - "Fine-tuning performance — how long to SFT a 7B model on this hardware?"
-  - "Stable Diffusion / image generation performance"
  - "Training from scratch — is it practical for any meaningful model size?"
+  - "Speculative decoding speedup factor (tokens/sec improvement not quantified yet)"
 ---

 # AI Workloads and Model Capabilities
@ -34,6 +33,7 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu
 | Llama 3.2 3B          | ~100 tokens/sec   | —         | Jeff Geerling  |
 | GPT-OSS-120B          | ~14.5 tokens/sec  | INT4      | ServeTheHome   |
 | Llama 3.1 70B         | Competitive w/ Ryzen AI Max+ 395 | — | Jeff Geerling |
+| Nemotron-3-Nano 30B   | Runs (MoE, 3B active) | Q8_K    | build.nvidia.com/spark |
 | HPL (Linpack) FP64    | ~675 GFLOPS       | FP64      | Jeff Geerling  |
 | Geekbench 6           | Comparable to Ryzen AI Max+ 395; trails Apple M3 Ultra | — | Jeff Geerling |

@ -41,6 +41,8 @@ The Dell Pro Max GB10 is designed primarily for **local AI inference and fine-tu

 **INT4 inference** on GPT-OSS-120B is roughly equivalent to an RTX 5070's performance (T2, ServeTheHome).

+**Nemotron-3-Nano 30B** is a MoE architecture (30B total, 3B active params) requiring ~38 GB GPU memory at Q8. Provides OpenAI-compatible API via llama.cpp server. (T1, build.nvidia.com/spark)
+
 ## 2. Model Size vs. Memory

 With 128 GB of unified memory, the system can hold:
@ -61,21 +63,58 @@ With 128 GB of unified memory, the system can hold:
 - Interactive chat, code generation, document analysis
 - Privacy-sensitive applications (medical, legal, financial)

-### Fine-Tuning
- Supervised fine-tuning (SFT) of models using NVIDIA NeMo
- LoRA/QLoRA for parameter-efficient fine-tuning of larger models
- Custom domain adaptation
+### Fine-Tuning (T1 Documented, build.nvidia.com/spark)
+
+NVIDIA provides official fine-tuning scripts with four approaches:
+
+| Script | Model | Method | Notes |
+|--------|-------|--------|-------|
+| Full SFT | Llama 3.2 3B | All parameters trainable | Fits in memory at bfloat16 |
+| LoRA | Llama 3.1 8B | Parameter-efficient adapters | `lora_rank=8` default |
+| LoRA + FSDP | Llama 3.1 70B | Distributed across 2 units | Multi-node via Docker Swarm |
+| QLoRA (4-bit) | Llama 3.1 70B | Quantized base + LoRA | Fits on single unit |
+
+- Container: `nvcr.io/nvidia/pytorch:25.11-py3`
+- Dependencies: `transformers`, `peft`, `datasets`, `trl`, `bitsandbytes`
+- Key params: `--batch_size`, `--seq_length` (default 2048), `--num_epochs`, `--gradient_checkpointing`
+- Dataset: Alpaca (configurable `--dataset_size`, default 512 samples)
+- Multi-node: Docker Swarm + FSDP for 2-unit distributed training

 ### AI Prototyping
 - Rapid iteration on model architectures
 - Dataset preprocessing with RAPIDS
 - Experiment tracking and evaluation

+### Image Generation (T1 Documented, build.nvidia.com/spark)
+- **ComfyUI** confirmed working — node-based UI for Stable Diffusion, SDXL, Flux
+- Runs natively on Blackwell GPU with CUDA 13.0
+- See [[ai-frameworks]] §4 for setup details
+
+### Speculative Decoding (T1 Documented, build.nvidia.com/spark)
+- Accelerates LLM inference by using a small draft model to predict tokens verified by the large model
+- **EAGLE-3:** Built-in drafting head (no separate model needed)
+- **Draft-Target:** Pairs 8B draft + 70B target with FP4 quantization
+- Uses TensorRT-LLM container (`tensorrt-llm/release:1.2.0rc6`)
+- Configurable `max_draft_len` (1-8 tokens) and KV cache memory fraction
+
 ### Data Science
 - GPU-accelerated analytics with RAPIDS
 - Large-scale data processing
 - Graph analytics

+### Scientific Computing (T1 Documented, build.nvidia.com/spark)
+
+**Single-cell RNA Sequencing:**
+- RAPIDS-singlecell library (GPU-accelerated, follows Scanpy API)
+- Full scRNA-seq pipeline in ~130 seconds (preprocessing ~21s, clustering/DE ~104s)
+- Requires ~40 GB unified memory
+- Computes exact nearest-neighbor graph (vs. Scanpy's approximate)
+
+**Portfolio Optimization:**
+- cuOpt LP/MILP solvers + cuML for GPU-accelerated KDE
+- Mean-CVaR (Conditional Value-at-Risk) modeling
+- Full pipeline in ~7 minutes
+
 ### Gaming (bonus, not primary use case)
 Surprisingly, ARM Linux gaming works via FEX (x86-to-ARM translation) + Steam/Proton:
 - Cyberpunk 2077: ~100 fps at 1080p, low settings (T2, Jeff Geerling)
--- a/context/dgx-os-software.md
+++ b/context/dgx-os-software.md
@ -41,7 +41,7 @@ The system ships ready to run AI workloads with:
 - **NVIDIA drivers** — optimized for GB10 Blackwell GPU
 - **Docker + NVIDIA Container Runtime** — container support out of the box (T0 Spec)
 - **NVIDIA Sync** — cross-platform desktop app for remote device management (see §8)
- **DGX Dashboard** — system monitoring with integrated JupyterLab
+- **DGX Dashboard** — system monitoring web UI at `http://localhost:11000` with integrated JupyterLab (see §9)
 - **NGC** — access to NVIDIA GPU Cloud containerized applications and models
 - **AI Enterprise** — enterprise-grade AI software assets and services
 - **Python** — system Python plus development environments
@ -145,6 +145,28 @@ NVIDIA Sync is a **cross-platform desktop application** (macOS, Windows, Linux)
 - **Connection timeout during boot:** Wait for device to fully boot
 - **Authentication failure:** Reconfigure connection in Sync app

+## 9. DGX Dashboard (T1 Documented, build.nvidia.com/spark)
+
+DGX Dashboard is a locally-hosted web application for system management and development.
+
+### Access
+
+- **Local:** `http://localhost:11000` or desktop shortcut in Ubuntu app launcher
+- **Remote via NVIDIA Sync:** Automatic SSH tunnel (recommended)
+- **Remote via manual SSH:** `ssh -L 11000:localhost:11000 user@spark-ip`
+
+For JupyterLab remote access, also forward the user-specific port from:
+`/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`
+
+### Features
+
+- **GPU/system monitoring** — real-time resource utilization panels and telemetry
+- **JupyterLab** — one-click launch with pre-configured Python virtual environments
+  - Working directory: `/home/<USERNAME>/jupyterlab`
+  - Requirements tracking via `requirements.txt`
+- **System updates** — package and firmware update management via GUI
+- **Settings** — system configuration interface
+
 ## Key Relationships

 - Runs on: [[gb10-superchip]]
--- a/context/equations-and-bounds.md
+++ b/context/equations-and-bounds.md
@ -38,6 +38,8 @@ Reference for all quantitative specifications, formulas, and validation ranges f
 - **Copy engines:** 2 (T0 Spec)
 - **NVENC:** 1 (T0 Spec)
 - **NVDEC:** 1 (T0 Spec)
+- **CUDA compute capability:** sm_121 (T1, build.nvidia.com/spark)
+- **CUDA toolkit:** 13.0 / cu130 (T1, build.nvidia.com/spark)

 ## 2. Memory

--- a/context/gb10-superchip.md
+++ b/context/gb10-superchip.md
@ -48,6 +48,7 @@ The Blackwell GPU portion features:
 - **4th-generation RT Cores** — ray tracing acceleration (T0 Spec)
 - **1x NVENC / 1x NVDEC** — hardware video encode/decode engines (T0 Spec)
 - **2 copy engines** (T0 Spec)
+- **CUDA compute capability:** `sm_121` (T1 Documented, build.nvidia.com/spark — required when compiling CUDA kernels with `-DCMAKE_CUDA_ARCHITECTURES="121"`)
 - Peak performance: **1 PFLOP (1,000 TFLOPS) at FP4 precision with sparsity**

 The Tensor Cores are the key differentiator for AI workloads, providing hardware acceleration for mixed-precision matrix operations used in deep learning.
--- a/context/open-questions.md
+++ b/context/open-questions.md
@ -86,11 +86,14 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del
  - *Status:* Only Llama 3.2 3B (~100 tok/s) and GPT-OSS-120B (~14.5 tok/s) benchmarked.
  - *Would resolve:* Most common use case performance
 - **Q:** Fine-tuning time estimates for common model sizes?
-  - *Status:* Unknown.
+  - *Status:* Partially resolved — scripts and methods documented (Full SFT 3B, LoRA 8B, QLoRA 70B) but wall-clock times not published.
  - *Would resolve:* Training workflow planning
 - **Q:** Stable Diffusion / image generation performance?
-  - *Status:* Unknown.
+  - *Status:* **Partially resolved** — ComfyUI confirmed working with SD 1.5. Quantitative benchmarks (images/sec) not published.
  - *Would resolve:* Non-LLM AI workload suitability
+- **Q:** Speculative decoding speedup factor?
+  - *Status:* EAGLE-3 and Draft-Target methods documented. Quantitative speedup (tokens/sec improvement) not published.
+  - *Would resolve:* Inference optimization ROI

 ---

@ -130,3 +133,13 @@ Catalog of known unknowns, research gaps, and unresolved questions about the Del
 | 2026-02-14 | Power adapter dimensions?                   | 23 x 78 x 162 mm, multi-voltage output (5V-48V)                | Dell Owner's Manual Rev A01     |
 | 2026-02-14 | USB-C MST support?                          | Not supported (single display per port only)                    | Dell Owner's Manual Rev A01     |
 | 2026-02-14 | Service tools required?                     | Phillips #0, T5 or T8 Torx screwdriver                         | Dell Owner's Manual Rev A01     |
+| 2026-02-14 | CUDA compute capability / SM architecture?  | sm_121 (compile with `-DCMAKE_CUDA_ARCHITECTURES="121"`)       | build.nvidia.com/spark          |
+| 2026-02-14 | CUDA toolkit version?                       | CUDA 13.0 (PyTorch wheels: cu130)                               | build.nvidia.com/spark          |
+| 2026-02-14 | DGX Dashboard URL/port?                     | `http://localhost:11000`                                        | build.nvidia.com/spark          |
+| 2026-02-14 | TensorRT-LLM availability?                  | Confirmed — container `tensorrt-llm/release:1.2.0rc6`           | build.nvidia.com/spark          |
+| 2026-02-14 | Fine-tuning methods supported?              | Full SFT (3B), LoRA (8B), QLoRA 4-bit (70B), FSDP multi-node   | build.nvidia.com/spark          |
+| 2026-02-14 | Image generation support?                   | ComfyUI confirmed (SD, SDXL, Flux) on port 8188                | build.nvidia.com/spark          |
+| 2026-02-14 | Ollama / Open WebUI support?                | Yes — Docker container, port 12000 (Sync) or 8080 (direct)     | build.nvidia.com/spark          |
+| 2026-02-14 | NVIDIA Sync details?                        | Cross-platform app, SSH key automation, VS Code/Cursor/Dashboard launch, port forwarding | build.nvidia.com/spark |
+| 2026-02-14 | PyTorch NGC container?                      | `nvcr.io/nvidia/pytorch:25.11-py3` (ARM64)                     | build.nvidia.com/spark          |
+| 2026-02-14 | Speculative decoding methods?               | EAGLE-3 (built-in drafting) and Draft-Target (8B+70B)           | build.nvidia.com/spark          |
--- a/phases/phase-04-spark-playbooks.md
+++ b/phases/phase-04-spark-playbooks.md
@ -0,0 +1,70 @@
+# Phase 4: NVIDIA Spark Playbooks Integration
+
+**Date:** 2026-02-14
+**Goal:** Integrate official NVIDIA playbooks from build.nvidia.com/spark into knowledge base
+
+## Source
+
+- https://build.nvidia.com/spark (main page, 9 playbooks + connection guide)
+
+## Key Discoveries
+
+### Critical Technical Facts (previously unknown)
+
+1. **CUDA compute capability: `sm_121`** — required for compiling CUDA kernels on Blackwell GB10 (`-DCMAKE_CUDA_ARCHITECTURES="121"`)
+2. **CUDA toolkit version: 13.0** — PyTorch wheels use `cu130` index
+3. **DGX Dashboard runs on port 11000** — JupyterLab ports in `/opt/nvidia/dgx-dashboard-service/jupyterlab_ports.yaml`
+4. **TensorRT-LLM confirmed** — container `tensorrt-llm/release:1.2.0rc6`
+5. **PyTorch NGC container:** `nvcr.io/nvidia/pytorch:25.11-py3`
+6. **RAPIDS container:** version 25.10
+7. **UMA buffer cache flush:** `sudo sh -c 'sync; echo 3 > /proc/sys/vm/drop_caches'`
+
+### Fine-Tuning (fully documented)
+
+- **Full SFT:** Llama 3.2 3B (all parameters, bfloat16)
+- **LoRA:** Llama 3.1 8B (rank 8 default)
+- **LoRA + FSDP:** Llama 3.1 70B (multi-node via Docker Swarm)
+- **QLoRA 4-bit:** Llama 3.1 70B (single unit)
+- Dependencies: transformers, peft, datasets, trl, bitsandbytes
+
+### Inference Tools
+
+- **llama.cpp:** Build with CUDA sm_121, provides OpenAI-compatible API (streaming, function calling)
+- **Nemotron-3-Nano 30B:** MoE (3B active), ~38 GB at Q8, built-in reasoning/tool-calling
+- **Speculative Decoding:** EAGLE-3 (built-in drafting) and Draft-Target (8B+70B, FP4)
+- **Ollama + Open WebUI:** Docker container, ports 12000 (Sync) or 8080 (direct)
+
+### Image Generation
+
+- **ComfyUI** confirmed working (SD, SDXL, Flux) on port 8188
+- Native Blackwell GPU acceleration with CUDA 13.0
+
+### Scientific Computing
+
+- **scRNA-seq:** RAPIDS-singlecell, ~130s full pipeline, exact nearest-neighbor graph
+- **Portfolio Optimization:** cuOpt + cuML, Mean-CVaR model, ~7 min pipeline
+
+### Development Environment
+
+- **VS Code:** ARM64 .deb install or remote SSH via Sync
+- **Cursor:** Remote SSH via Sync
+- **NVIDIA AI Workbench:** Launchable via Sync
+- **NVIDIA Sync:** Full details documented (SSH key automation, mDNS, port forwarding)
+
+## Files Updated
+
+- `context/gb10-superchip.md` — sm_121 CUDA architecture
+- `context/ai-frameworks.md` — Major expansion: CUDA 13.0, TensorRT-LLM, Ollama, ComfyUI, NGC containers, UMA tip
+- `context/ai-workloads.md` — Fine-tuning scripts, Nemotron, speculative decoding, image gen, scientific computing
+- `context/dgx-os-software.md` — NVIDIA Sync §8 (full detail), DGX Dashboard §9 (port, features)
+- `context/setup-and-config.md` — NVIDIA Sync cross-reference
+- `context/equations-and-bounds.md` — sm_121, CUDA 13.0
+- `context/open-questions.md` — 11 new resolved questions, 1 new open question
+- `CLAUDE.md` — Phase 4 added to history
+
+## Remaining Gaps
+
+- Quantitative speculative decoding speedup (tokens/sec improvement not published)
+- ComfyUI image generation benchmarks (images/sec)
+- Fine-tuning wall-clock times
+- Full list of Ollama-compatible models tested on GB10