Nodes#
| Node | Hardware | CPU | RAM | Architecture |
|---|---|---|---|---|
talos-76w-3r0 | Custom-built PC | AMD Ryzen 9 9950X3D (16c/32t, 3D V-Cache) | 192 GB DDR5 | AMD64 |
talos-7aj-lwl | NVIDIA DGX Spark (ASUS OEM) | GB10 Grace Blackwell SoC (20c) | 128 GB unified | ARM64 |
talos-ysi-4k0 | NVIDIA DGX Spark (ASUS OEM) | GB10 Grace Blackwell SoC (20c) | 128 GB unified | ARM64 |
All three nodes serve as both control-plane and worker nodes.
NVIDIA DGX Spark#
The DGX Spark is NVIDIA’s desktop AI workstation, built around the GB10 Grace Blackwell superchip:
- 1 PFLOP FP4 AI compute per unit (plus ~1.4 PFLOPS from the RTX 5070 Ti — ~3.4 PFLOPS total across the cluster)
- 128 GB unified memory shared between CPU and GPU — no PCIe bottleneck
- ARM64 (Grace CPU) — 20 Cortex-A720 cores
- ConnectX-7 NIC — Mellanox 200 Gbps QSFP56, SR-IOV capable
The unified memory architecture means the GPU can access the full 128 GB without transfers. This matters for large model inference and fine-tuning workloads that would be memory-constrained on discrete GPU systems.
The 200 Gbps interconnect#
The two DGX Sparks are directly connected via a QSFP56 DAC (Direct Attach Copper) cable between their ConnectX-7 NICs. No switch in the path — it’s a point-to-point 200 Gbps link.
graph LR
DGX1["DGX Spark #1
talos-7aj-lwl
128 GB"] --- QSFP["200 Gbps QSFP DAC
ConnectX-7 ↔ ConnectX-7"] --- DGX2["DGX Spark #2
talos-ysi-4k0
128 GB"]
This link supports RDMA (Remote Direct Memory Access) via RoCE v2, enabling GPU-Direct RDMA for NCCL multi-node training. The ConnectX-7 NICs are configured with SR-IOV to present Virtual Functions to pods, so containers can get direct access to the 200 Gbps link without going through the kernel network stack.
See GPU & AI for the full software stack that makes this work in Kubernetes.
The Ryzen node#
The custom PC fills a different role:
- AMD Ryzen 9 9950X3D — 16 cores / 32 threads with 3D V-Cache for cache-heavy workloads
- 192 GB DDR5 — the largest memory pool in the cluster
- NVIDIA RTX 5070 Ti — for AMD64 GPU workloads (CUDA, inference, dev/test)
- AMD64 — runs workloads that don’t have ARM64 builds
It handles most of the platform services (Keycloak, ArgoCD, Grafana, etc.) and can also run GPU workloads on the 5070 Ti, while the DGX Sparks focus on heavier AI training.
Operating system#
All nodes run Talos Linux — an immutable, API-driven Kubernetes OS. Key characteristics:
- No SSH. Everything is managed via the Talos API or
kubectl debug. - Read-only root filesystem. This has real implications — the SR-IOV Network Operator doesn’t work on Talos because it expects to write to
/etc/. I had to build a standalone SR-IOV device plugin instead. - Managed via Omni (Sidero Labs hosted platform) for lifecycle management and upgrades.
- Kubernetes with containerd.
Network topology#
graph TD
Internet["Internet"] --> Router["Home Router"]
Router --> Switch["Network Switch
192.168.4.0/24"]
Switch -->|"1 GbE"| PC["Custom PC
enp12s0"]
Switch -->|"1 GbE"| DGX1["DGX Spark #1
enP7s7"]
Switch -->|"1 GbE"| DGX2["DGX Spark #2
enP7s7"]
DGX1 ---|"200 Gbps QSFP DAC
(point-to-point)"| DGX2
style DGX1 fill:#76b900,color:#000
style DGX2 fill:#76b900,color:#000
style PC fill:#0078d4,color:#fff
Each node connects to the home network via 1 GbE for management and general traffic. The 200 Gbps link is exclusively for inter-DGX GPU communication.
All-Blackwell GPU cluster#
Every node in the cluster has a Blackwell-generation GPU:
| Node | GPU | Architecture | Compute Capability |
|---|---|---|---|
talos-76w-3r0 | RTX 5070 Ti | Blackwell (GB203) | sm_120 |
talos-7aj-lwl | GB10 (DGX Spark) | Blackwell (GB10B) | sm_100 |
talos-ysi-4k0 | GB10 (DGX Spark) | Blackwell (GB10B) | sm_100 |
Having a uniform GPU generation across the cluster — despite mixed CPU architectures (AMD64 + ARM64) — means every node supports the same Blackwell features:
- FP4 inference — Blackwell’s headline precision, delivering up to 2x throughput over FP8. Useful for experimenting with 4-bit quantized model serving.
- FP8 training and inference — second-gen FP8 support with improved accuracy via the Transformer Engine.
- NVIDIA Transformer Engine — automatic mixed-precision at the layer level. Available on every node, so I can test TE-accelerated training on both the DGX Sparks (multi-node) and the 5070 Ti (single-node dev iteration).
- Consistent CUDA toolkit — no need for multiple code paths or per-node compatibility workarounds. A container built for Blackwell runs on any node.
This also makes the cluster a useful testbed for Blackwell-specific features as NVIDIA rolls them out — any new CUDA capability that targets Blackwell compute capability is immediately available across the entire cluster.