4.5 KiB
| id | title | status | source_sections | related_topics | key_equations | key_terms | images | examples | open_questions |
|---|---|---|---|---|---|---|---|---|---|
| multi-unit-stacking | Multi-Unit Stacking | established | NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review | [connectivity gb10-superchip ai-workloads memory-and-storage] | [] | [connectx-7 smartnic qsfp stacking mpi nccl slurm kubernetes] | [] | [] | [Performance overhead of inter-unit communication vs. single unit (quantified) Can more than 2 units be stacked? Actual tokens/sec for 405B models on stacked configuration] |
Multi-Unit Stacking
Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models.
1. How It Works
Each Dell Pro Max GB10 has 2x QSFP56 200 Gbps ports powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection:
- Combined memory: 256 GB (128 GB per unit, NOT unified — distributed across nodes)
- Combined compute: 2 PFLOP FP4 (1 PFLOP per unit)
- Interconnect: 200GbE RDMA via QSFP56 DAC cable
- CX-7 ports support Ethernet configuration only — no InfiniBand (T1 Documented)
2. Required Hardware
Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide)
| Manufacturer | Part Number | Description |
|---|---|---|
| Amphenol | NJAAKK-N911 | QSFP to QSFP112, 32AWG, 400mm, LSZH |
| Amphenol | NJAAKK0006 | 0.5m variant |
| Luxshare | LMTQF022-SD-R | QSFP112 400G DAC Cable, 400mm, 30AWG |
These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other.
3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide)
Prerequisites
- Two DGX Spark / Dell Pro Max GB10 systems
- Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed
- Internet connectivity for initial setup
- Root/sudo access on both systems
Network Setup
Option 1 — Automatic (Recommended):
Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard netplan apply commands.
Option 2 — Manual static IP:
- Interface name:
enP2p1s0f1np1 - Node 1:
192.168.100.10/24 - Node 2:
192.168.100.11/24 - Verify with ping test between nodes
SSH Configuration
The NVIDIA discovery script automates passwordless SSH between nodes, required for MPI communication.
Communication Frameworks
- MPI — inter-process CPU communication between nodes
- NCCL v2.28.3 — GPU-accelerated collective operations across nodes
Verification
- Ping connectivity test between nodes
- Interface verification:
ip aandethtool - NCCL test suite execution (via NVIDIA playbook)
4. How It Appears to Software
Stacking does NOT present as a single logical device. It creates a 2-node distributed cluster requiring explicit multi-node code:
- Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM)
- MPI handles inter-process communication
- NCCL handles GPU-to-GPU tensor transfers across the 200GbE link
- This is fundamentally different from a single larger GPU — there is communication overhead
5. Model Capacity
| Configuration | Memory | Max Model Size (approx) |
|---|---|---|
| Single unit | 128 GB | ~200B parameters (FP4) |
| Dual stacked | 256 GB | ~405B parameters (FP4) |
This enables running models like Llama 3.1 405B (with quantization) that would not fit in a single unit's memory.
6. Scaling Beyond 2 Units
The documentation mentions potential for:
- Job orchestration with Slurm or Kubernetes
- Containerized execution with Singularity or Docker
Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible.
7. Physical Configuration
The compact form factor (150x150x51mm per unit) is designed to be stackable — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables.
Key Relationships
- Connected via: connectivity (QSFP56/ConnectX-7 ports)
- Extends capacity of: ai-workloads
- Doubles resources from: gb10-superchip, memory-and-storage
- Software stack: dgx-os-software