This is the most technically interesting part of the cluster, and the area where I’m actively learning the most. The goal is multi-node GPU training across the two DGX Sparks using the 200 Gbps ConnectX-7 interconnect with GPU-Direct RDMA.
The hardware stack#
Each DGX Spark has:
- GB10 Grace Blackwell SoC — 1 PFLOP FP4, 128 GB unified memory
- ConnectX-7 NIC — 200 Gbps QSFP56, SR-IOV capable, RDMA (RoCE v2)
The two units are connected point-to-point via a QSFP56 DAC cable. No switch, no hops.
The Ryzen node also has an RTX 5070 Ti (Blackwell GB203), giving the cluster an all-Blackwell GPU setup. This means uniform support for FP4/FP8 precision, Transformer Engine, and consistent CUDA compute capabilities across every node — useful for developing on the 5070 Ti and deploying to the DGX Sparks without compatibility concerns.
The software stack#
Getting GPUs and RDMA working in Kubernetes on Talos Linux required assembling several components:
NVIDIA Device Plugin#
The NVIDIA Device Plugin advertises GPU resources to the Kubernetes scheduler. Pods request GPUs via nvidia.com/gpu: 1 in their resource spec. A corresponding RuntimeClass ensures GPU pods use the NVIDIA container runtime.
SR-IOV Device Plugin (standalone)#
This is where things get interesting. The standard approach for SR-IOV in Kubernetes is the SR-IOV Network Operator, but it doesn’t work on Talos Linux. The operator’s config daemon expects to write to /etc/sriov-operator/, which doesn’t exist on Talos’s read-only root filesystem.
Instead, I built a standalone DaemonSet that:
- Runs an init container that creates 4 SR-IOV Virtual Functions on each DGX Spark’s ConnectX-7 NIC using
sysfs - Runs the SR-IOV Device Plugin container that discovers the VFs and advertises them to Kubernetes as
nvidia.com/cx7_qsfpresources (4 per node)
See the journal entry on SR-IOV on Talos for the full story.
Multus CNI#
Multus is a meta-CNI that wraps Cilium and enables pods to have multiple network interfaces. On the DGX Sparks, pods can get:
eth0— primary interface via Cilium (home network)net1— secondary interface via SR-IOV VF (200 Gbps QSFP link)
Multus runs as a “thick plugin” in kube-system with Talos-specific patches for the /var/run/netns/ path and resource limits.
NetworkAttachmentDefinition#
The dgx-qsfp NetworkAttachmentDefinition configures the secondary interface:
- IPAM:
host-localon subnet10.100.0.0/24 - RDMA: enabled
- Type: SR-IOV
How it fits together#
graph TD
subgraph "DGX Spark #1"
Pod1["Training Pod"] --> VF1["SR-IOV VF
net1: 10.100.0.x"]
Pod1 --> GPU1["Blackwell GPU
128 GB"]
VF1 --> CX1["ConnectX-7
PF"]
end
subgraph "DGX Spark #2"
Pod2["Training Pod"] --> VF2["SR-IOV VF
net1: 10.100.0.x"]
Pod2 --> GPU2["Blackwell GPU
128 GB"]
VF2 --> CX2["ConnectX-7
PF"]
end
CX1 ---|"200 Gbps QSFP DAC
RDMA / RoCE v2"| CX2
When a pod on DGX Spark #1 needs to communicate with a pod on DGX Spark #2 for distributed training:
- NCCL detects the
net1RDMA interface - Data transfers bypass the kernel via RDMA (RoCE v2)
- GPU-Direct RDMA allows the GPU to read/write directly to the NIC’s memory, skipping the CPU entirely
Example pod spec#
resources:
requests:
nvidia.com/gpu: 1
nvidia.com/cx7_qsfp: 1
limits:
nvidia.com/gpu: 1
nvidia.com/cx7_qsfp: 1With the annotation:
annotations:
k8s.v1.cni.cncf.io/networks: dgx-qsfpNCCL environment variables#
NCCL_SOCKET_IFNAME=net1
NCCL_IB_DISABLE=0
NCCL_NET_GDR_LEVEL=5 # GPU-Direct RDMA
NCCL_DEBUG=INFOCurrent status#
This is a work in progress. I’ve gotten:
- SR-IOV VFs created and advertised to Kubernetes
- Multus attaching secondary interfaces to pods
- Basic RDMA connectivity between pods on the two DGX Sparks
What I’m still working on:
- Validating GPU-Direct RDMA end-to-end with NCCL benchmarks
- Tuning NCCL parameters for optimal throughput on the 200 Gbps link
- Building a reproducible multi-node training workflow
Follow along in the journal.