willingham-k8s

The hardware

Node	Hardware	CPU	RAM	Role
`talos-76w-3r0`	Custom PC	AMD Ryzen 9 9950X3D (16c/32t)	192 GB DDR5	Control plane + workloads
`talos-7aj-lwl`	NVIDIA DGX Spark	GB10 Grace Blackwell (20c ARM)	128 GB	Control plane + GPU workloads
`talos-ysi-4k0`	NVIDIA DGX Spark	GB10 Grace Blackwell (20c ARM)	128 GB	Control plane + GPU workloads

Node

Hardware

CPU

RAM

Role

talos-76w-3r0

Custom PC

AMD Ryzen 9 9950X3D (16c/32t)

192 GB DDR5

Control plane + workloads

talos-7aj-lwl

NVIDIA DGX Spark

GB10 Grace Blackwell (20c ARM)

128 GB

Control plane + GPU workloads

talos-ysi-4k0

NVIDIA DGX Spark

GB10 Grace Blackwell (20c ARM)

128 GB

Control plane + GPU workloads

Totals: 56 cores, 448 GB RAM, 2 PFLOPS FP4 AI compute.

The two DGX Sparks are connected via a 200 Gbps QSFP DAC direct-attach cable using Mellanox ConnectX-7 NICs with SR-IOV, enabling GPU-Direct RDMA for multi-node training with NCCL.

What’s running

The cluster runs a full platform stack — not because everything is needed, but because each component is something I wanted to understand deeply.

GitOps: FluxCD reconciles everything from Git. No kubectl apply, ever.

Networking: Cilium (eBPF CNI), MetalLB (L2), kgateway (Gateway API / Envoy), Tailscale subnet routing

Storage: Longhorn with the v2 data engine

Identity: Keycloak providing OIDC SSO for every web UI in the cluster

Observability: VictoriaMetrics, Grafana, Jaeger (OpenTelemetry), Hubble

Messaging: Kafka (Strimzi, KRaft mode), NATS (JetStream), RabbitMQ

Serverless: Knative Serving + Eventing (backed by Kafka)

Dev Platform: Coder, Temporal, GitHub Actions runners (ARC), HCP Terraform agents, Camel K

GPU/AI: NVIDIA device plugin, SR-IOV device plugin, Multus CNI for RDMA interfaces

What I’m learning right now

My current focus is on RDMA and SR-IOV — figuring out how to get multi-node GPU training working efficiently across the two DGX Sparks using the 200 Gbps ConnectX-7 interconnect. It’s been a deep rabbit hole involving:

SR-IOV Virtual Functions on Talos Linux (which has a read-only root filesystem)

Multus CNI for secondary network interfaces

NCCL configuration for GPU-Direct RDMA

The gap between “it works in a single node” and “it works across nodes”

The hardware

What’s running

What I’m learning right now

Design philosophy

My RDMA Journey with DGX Spark

SR-IOV on Talos: Why the Operator Doesn’t Work