You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 

4.5 KiB

id title status source_sections related_topics key_equations key_terms images examples open_questions
multi-unit-stacking Multi-Unit Stacking established NVIDIA DGX Spark User Guide: Spark Stacking, Jeff Geerling review, ServeTheHome review [connectivity gb10-superchip ai-workloads memory-and-storage] [] [connectx-7 smartnic qsfp stacking mpi nccl slurm kubernetes] [] [] [Performance overhead of inter-unit communication vs. single unit (quantified) Can more than 2 units be stacked? Actual tokens/sec for 405B models on stacked configuration]

Multi-Unit Stacking

Two Dell Pro Max GB10 units can be connected together to create a distributed compute cluster, effectively doubling the available compute and memory for running larger AI models.

1. How It Works

Each Dell Pro Max GB10 has 2x QSFP56 200 Gbps ports powered by the NVIDIA ConnectX-7 SmartNIC. These ports enable direct unit-to-unit connection:

  • Combined memory: 256 GB (128 GB per unit, NOT unified — distributed across nodes)
  • Combined compute: 2 PFLOP FP4 (1 PFLOP per unit)
  • Interconnect: 200GbE RDMA via QSFP56 DAC cable
  • CX-7 ports support Ethernet configuration only — no InfiniBand (T1 Documented)

2. Required Hardware

Approved QSFP DAC Cables (T1 Documented, NVIDIA DGX Spark User Guide)

Manufacturer Part Number Description
Amphenol NJAAKK-N911 QSFP to QSFP112, 32AWG, 400mm, LSZH
Amphenol NJAAKK0006 0.5m variant
Luxshare LMTQF022-SD-R QSFP112 400G DAC Cable, 400mm, 30AWG

These are short DAC (Direct Attach Copper) cables. The units are designed to sit directly on top of each other.

3. Software Configuration (T1 Documented, NVIDIA DGX Spark User Guide)

Prerequisites

  • Two DGX Spark / Dell Pro Max GB10 systems
  • Both running Ubuntu 24.04 (or later) with NVIDIA drivers installed
  • Internet connectivity for initial setup
  • Root/sudo access on both systems

Network Setup

Option 1 — Automatic (Recommended): Use NVIDIA's netplan playbook downloaded from their repository. Applied via standard netplan apply commands.

Option 2 — Manual static IP:

  • Interface name: enP2p1s0f1np1
  • Node 1: 192.168.100.10/24
  • Node 2: 192.168.100.11/24
  • Verify with ping test between nodes

SSH Configuration

The NVIDIA discovery script automates passwordless SSH between nodes, required for MPI communication.

Communication Frameworks

  • MPI — inter-process CPU communication between nodes
  • NCCL v2.28.3 — GPU-accelerated collective operations across nodes

Verification

  1. Ping connectivity test between nodes
  2. Interface verification: ip a and ethtool
  3. NCCL test suite execution (via NVIDIA playbook)

4. How It Appears to Software

Stacking does NOT present as a single logical device. It creates a 2-node distributed cluster requiring explicit multi-node code:

  • Frameworks must use distributed execution (e.g., PyTorch Distributed, Megatron-LM)
  • MPI handles inter-process communication
  • NCCL handles GPU-to-GPU tensor transfers across the 200GbE link
  • This is fundamentally different from a single larger GPU — there is communication overhead

5. Model Capacity

Configuration Memory Max Model Size (approx)
Single unit 128 GB ~200B parameters (FP4)
Dual stacked 256 GB ~405B parameters (FP4)

This enables running models like Llama 3.1 405B (with quantization) that would not fit in a single unit's memory.

6. Scaling Beyond 2 Units

The documentation mentions potential for:

  • Job orchestration with Slurm or Kubernetes
  • Containerized execution with Singularity or Docker

Whether >2 units can be practically clustered is not explicitly documented, but the 200GbE RDMA networking and Slurm/K8s support suggest it is architecturally possible.

7. Physical Configuration

The compact form factor (150x150x51mm per unit) is designed to be stackable — two units sit on top of each other on a desk, connected via short (400-500mm) QSFP DAC cables.

Key Relationships