4.5 KiB

Raw Blame History

id	title	status	source_sections	related_topics	key_equations	key_terms	images	examples	open_questions
multi-unit-stacking	Multi-Unit Stacking	established	NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review	[connectivity gb10-superchip ai-workloads memory-and-storage]	[]	[connectx-7 smartnic qsfp stacking mpi nccl slurm kubernetes]	[]	[]	[Performance overhead of inter-unit communication vs. single unit (quantified) Can more than 2 units be stacked? Actual tokens/sec for 405B models on stacked configuration]

Multi-Unit Stacking

Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models.

1. How It Works

Each Dell Pro Max GB10 has 2x QSFP56 200 Gbps ports powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection:

Combined memory: 256 GB (128 GB per unit, NOT unified — distributed across nodes)
Combined compute: 2 PFLOP FP4 (1 PFLOP per unit)
Interconnect: 200GbE RDMA via QSFP56 DAC cable
CX-7 ports support Ethernet configuration only — no InfiniBand (T1 Documented)

2. Required Hardware

Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide)

Manufacturer	Part Number	Description
Amphenol	NJAAKK-N911	QSFP to QSFP112, 32AWG, 400mm, LSZH
Amphenol	NJAAKK0006	0.5m variant
Luxshare	LMTQF022-SD-R	QSFP112 400G DAC Cable, 400mm, 30AWG

These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other.

3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide)

Prerequisites

Two DGX Spark / Dell Pro Max GB10 systems
Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed
Internet connectivity for initial setup
Root/sudo access on both systems

Network Setup

Option 1 — Automatic (Recommended): Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard netplan apply commands.

Option 2 — Manual static IP:

Interface name: enP2p1s0f1np1
Node 1: 192.168.100.10/24
Node 2: 192.168.100.11/24
Verify with ping test between nodes

SSH Configuration

The NVIDIA discovery script automates passwordless SSH between nodes, required for MPI communication.

Communication Frameworks

MPI — inter-process CPU communication between nodes
NCCL v2.28.3 — GPU-accelerated collective operations across nodes

Verification

Ping connectivity test between nodes
Interface verification: ip a and ethtool
NCCL test suite execution (via NVIDIA playbook)

4. How It Appears to Software

Stacking does NOT present as a single logical device. It creates a 2-node distributed cluster requiring explicit multi-node code:

Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM)
MPI handles inter-process communication
NCCL handles GPU-to-GPU tensor transfers across the 200GbE link
This is fundamentally different from a single larger GPU — there is communication overhead

5. Model Capacity

Configuration	Memory	Max Model Size (approx)
Single unit	128 GB	~200B parameters (FP4)
Dual stacked	256 GB	~405B parameters (FP4)

This enables running models like Llama 3.1 405B (with quantization) that would not fit in a single unit's memory.

6. Scaling Beyond 2 Units

The documentation mentions potential for:

Job orchestration with Slurm or Kubernetes
Containerized execution with Singularity or Docker

Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible.

7. Physical Configuration

The compact form factor (150x150x51mm per unit) is designed to be stackable — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables.

Key Relationships

Connected via: connectivity (QSFP56/ConnectX-7 ports)
Extends capacity of: ai-workloads
Doubles resources from: gb10-superchip, memory-and-storage
Software stack: dgx-os-software

4.5 KiB Raw Blame History