[{"content":" Nodes # Node Hardware CPU RAM Architecture talos-76w-3r0 Custom-built PC AMD Ryzen 9 9950X3D (16c/32t, 3D V-Cache) 192 GB DDR5 AMD64 talos-7aj-lwl NVIDIA DGX Spark (ASUS OEM) GB10 Grace Blackwell SoC (20c) 128 GB unified ARM64 talos-ysi-4k0 NVIDIA DGX Spark (ASUS OEM) GB10 Grace Blackwell SoC (20c) 128 GB unified ARM64 All three nodes serve as both control-plane and worker nodes.\nNVIDIA DGX Spark # The DGX Spark is NVIDIA\u0026rsquo;s desktop AI workstation, built around the GB10 Grace Blackwell superchip:\n1 PFLOP FP4 AI compute per unit (plus ~1.4 PFLOPS from the RTX 5070 Ti — ~3.4 PFLOPS total across the cluster) 128 GB unified memory shared between CPU and GPU — no PCIe bottleneck ARM64 (Grace CPU) — 20 Cortex-A720 cores ConnectX-7 NIC — Mellanox 200 Gbps QSFP56, SR-IOV capable The unified memory architecture means the GPU can access the full 128 GB without transfers. This matters for large model inference and fine-tuning workloads that would be memory-constrained on discrete GPU systems.\nThe 200 Gbps interconnect # The two DGX Sparks are directly connected via a QSFP56 DAC (Direct Attach Copper) cable between their ConnectX-7 NICs. No switch in the path — it\u0026rsquo;s a point-to-point 200 Gbps link.\ngraph LR DGX1[\"DGX Spark #1talos-7aj-lwl128 GB\"] --- QSFP[\"200 Gbps QSFP DACConnectX-7 ↔ ConnectX-7\"] --- DGX2[\"DGX Spark #2talos-ysi-4k0128 GB\"] This link supports RDMA (Remote Direct Memory Access) via RoCE v2, enabling GPU-Direct RDMA for NCCL multi-node training. The ConnectX-7 NICs are configured with SR-IOV to present Virtual Functions to pods, so containers can get direct access to the 200 Gbps link without going through the kernel network stack.\nSee GPU \u0026amp; AI for the full software stack that makes this work in Kubernetes.\nThe Ryzen node # The custom PC fills a different role:\nAMD Ryzen 9 9950X3D — 16 cores / 32 threads with 3D V-Cache for cache-heavy workloads 192 GB DDR5 — the largest memory pool in the cluster NVIDIA RTX 5070 Ti — for AMD64 GPU workloads (CUDA, inference, dev/test) AMD64 — runs workloads that don\u0026rsquo;t have ARM64 builds It handles most of the platform services (Keycloak, ArgoCD, Grafana, etc.) and can also run GPU workloads on the 5070 Ti, while the DGX Sparks focus on heavier AI training.\nOperating system # All nodes run Talos Linux — an immutable, API-driven Kubernetes OS. Key characteristics:\nNo SSH. Everything is managed via the Talos API or kubectl debug. Read-only root filesystem. This has real implications — the SR-IOV Network Operator doesn\u0026rsquo;t work on Talos because it expects to write to /etc/. I had to build a standalone SR-IOV device plugin instead. Managed via Omni (Sidero Labs hosted platform) for lifecycle management and upgrades. Kubernetes with containerd. Network topology # graph TD Internet[\"Internet\"] --\u003e Router[\"Home Router\"] Router --\u003e Switch[\"Network Switch192.168.4.0/24\"] Switch --\u003e|\"1 GbE\"| PC[\"Custom PCenp12s0\"] Switch --\u003e|\"1 GbE\"| DGX1[\"DGX Spark #1enP7s7\"] Switch --\u003e|\"1 GbE\"| DGX2[\"DGX Spark #2enP7s7\"] DGX1 ---|\"200 Gbps QSFP DAC(point-to-point)\"| DGX2 style DGX1 fill:#76b900,color:#000 style DGX2 fill:#76b900,color:#000 style PC fill:#0078d4,color:#fff Each node connects to the home network via 1 GbE for management and general traffic. The 200 Gbps link is exclusively for inter-DGX GPU communication.\nAll-Blackwell GPU cluster # Every node in the cluster has a Blackwell-generation GPU:\nNode GPU Architecture Compute Capability talos-76w-3r0 RTX 5070 Ti Blackwell (GB203) sm_120 talos-7aj-lwl GB10 (DGX Spark) Blackwell (GB10B) sm_100 talos-ysi-4k0 GB10 (DGX Spark) Blackwell (GB10B) sm_100 Having a uniform GPU generation across the cluster — despite mixed CPU architectures (AMD64 + ARM64) — means every node supports the same Blackwell features:\nFP4 inference — Blackwell\u0026rsquo;s headline precision, delivering up to 2x throughput over FP8. Useful for experimenting with 4-bit quantized model serving. FP8 training and inference — second-gen FP8 support with improved accuracy via the Transformer Engine. NVIDIA Transformer Engine — automatic mixed-precision at the layer level. Available on every node, so I can test TE-accelerated training on both the DGX Sparks (multi-node) and the 5070 Ti (single-node dev iteration). Consistent CUDA toolkit — no need for multiple code paths or per-node compatibility workarounds. A container built for Blackwell runs on any node. This also makes the cluster a useful testbed for Blackwell-specific features as NVIDIA rolls them out — any new CUDA capability that targets Blackwell compute capability is immediately available across the entire cluster.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/hardware/","section":"Infrastructure","summary":"Three nodes, two architectures, one 200 Gbps cable.","title":"Hardware","type":"infrastructure"},{"content":" Overview # Keycloak provides centralized identity management via the wcloud realm. Nearly every web UI in the cluster authenticates through Keycloak using OIDC.\nDeployed via the Keycloak Operator (OLM-managed) in the keycloak namespace.\nURL: keycloak.wcloud.sh\nOIDC consumers # Service Auth method ArgoCD OIDC client Argo Workflows OIDC client Grafana OIDC client Temporal OIDC client Kafbat (Kafka UI) OIDC client Headlamp kgateway OAuth2 policy → Keycloak RabbitMQ OAuth2 backend plugin External identity provider # Google OAuth is configured as an upstream identity provider in the wcloud realm. Users can sign in with their Google account, and Keycloak handles the federation and token issuance.\nSecrets # All OIDC client secrets are managed via the External Secrets Operator, synced from Infisical. No secrets live in Git.\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/keycloak/","section":"Services","summary":"Centralized OIDC SSO for every service in the cluster.","title":"Identity: Keycloak","type":"services"},{"content":" FluxCD dependency chain # Everything in the cluster is deployed via FluxCD Kustomizations, organized in a dependency chain that ensures components are created in the right order:\ngraph TD NS[\"namespaces\"] --\u003e CRDs[\"crds\"] CRDs --\u003e Routes[\"routes\"] CRDs --\u003e Base[\"basecert-manager, ESO, MetalLB,Longhorn, kgateway\"] Base --\u003e Operators[\"operatorsStrimzi, OLM, Percona,KEDA, Dragonfly, etc.\"] Operators --\u003e Network[\"networkMetalLB pools, L2Adv,Gateway, HTTPRoutes\"] Operators --\u003e Messaging[\"messagingKafka, NATS, RabbitMQ\"] Operators --\u003e DevPlatform[\"dev-platformCoder, ARC, TFC agents,Camel K\"] Operators --\u003e Observability[\"observabilityVMKS, Jaeger,Headlamp\"] Operators --\u003e Security[\"securityKeycloak\"] Operators --\u003e Argo[\"argoArgoCD, Workflows,Events\"] Observability --\u003e VPA[\"vpaVPA resources\"] Each box is a FluxCD Kustomization pointing to a directory in the Git repository. Arrows represent dependsOn relationships — a Kustomization won\u0026rsquo;t reconcile until its dependencies are healthy.\nWhy this ordering matters # Namespaces first — every other resource needs its namespace to exist. CRDs before base — operators like cert-manager need their CRDs registered before the HelmRelease can create CRs. Base before operators — cert-manager needs to be running before anything that needs TLS certificates. ESO needs to be running before anything that needs secrets from Infisical. Security and Argo deploy in parallel — ArgoCD configures Keycloak OIDC endpoints, but doesn\u0026rsquo;t need Keycloak running to deploy. Login works once Keycloak comes up naturally. Traffic flow # External traffic enters the cluster through a single path:\ngraph LR Client[\"Client(on Tailnet)\"] --\u003e DNS[\"Cloudflare DNS*.wcloud.sh\"] DNS --\u003e TS[\"TailscaleSubnet Router\"] TS --\u003e VIP[\"MetalLB VIP192.168.4.2\"] VIP --\u003e GW[\"kgateway(Envoy)\"] GW --\u003e|\"SNI routing\"| Svc[\"KubernetesService\"] Svc --\u003e Pod[\"Pod\"] CM[\"cert-manager\"] -.-\u003e|\"TLS certs\"| GW ExtDNS[\"external-dns\"] -.-\u003e|\"DNS records\"| DNS Key points:\nAll services are exposed under *.wcloud.sh and are only reachable via Tailscale. external-dns watches Gateway API HTTPRoute resources and automatically creates Cloudflare DNS records. cert-manager provisions Let\u0026rsquo;s Encrypt TLS certificates using Cloudflare DNS-01 challenges. kgateway (Envoy-based) performs TLS termination and SNI-based routing. MetalLB advertises the VIP (192.168.4.2) via L2/ARP on the home network. Repository structure # k8s-config/ ├── clusters/willingham-k8s/ # FluxCD Kustomization entry points ├── namespaces/ # Namespace definitions ├── crds/ # Custom Resource Definitions ├── base/ # Foundational operators │ ├── cert-manager/ │ ├── external-secrets/ │ ├── longhorn/ │ ├── metallb/ │ └── kgateway/ ├── operators/ # Infrastructure operators ├── network/ # MetalLB pools, Gateway, routes ├── messaging/ # Kafka, NATS, RabbitMQ ├── security/ # Keycloak ├── observability/ # VMKS, Jaeger, Headlamp ├── argo/ # ArgoCD, Workflows, Events ├── dev-platform/ # Coder, ARC, TFC, Camel K ├── vpa/ # VPA resources ├── routes/ # Gateway API HTTPRoutes └── docs/ # Documentation Every directory corresponds to a FluxCD Kustomization. The cluster reconciles the master branch continuously.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/architecture/","section":"Infrastructure","summary":"How the cluster is organized — FluxCD dependency chains, traffic flow, and component relationships.","title":"Architecture","type":"infrastructure"},{"content":" Metrics: VictoriaMetrics # The VictoriaMetrics Kubernetes Stack (VMKS) replaces the traditional Prometheus stack. It includes:\nVictoriaMetrics as the metrics backend (metrics.wcloud.sh) Grafana for dashboards (grafana.wcloud.sh), authenticated via Keycloak OIDC AlertManager for alert routing (alerts.wcloud.sh) Tracing: Jaeger + OpenTelemetry # The OpenTelemetry Operator manages a Jaeger collector deployed as an OpenTelemetryCollector CR. Uses Badger embedded storage on a 10Gi PVC with 7-day trace retention.\nURL: jaeger.wcloud.sh\nNetwork observability: Hubble # Hubble provides L3/L4/L7 network flow visibility, powered by Cilium\u0026rsquo;s eBPF datapath. Useful for debugging network policies and understanding traffic patterns.\nURL: hubble.wcloud.sh\nKubernetes dashboard: Headlamp # Headlamp serves as the Kubernetes dashboard with plugins for:\nFlux cert-manager KEDA Authentication is handled via a kgateway OAuth2 policy that redirects to Keycloak.\nURL: k8s.wcloud.sh\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/observability/","section":"Services","summary":"Metrics, tracing, and dashboards — VictoriaMetrics, Grafana, Jaeger, Hubble, and Headlamp.","title":"Observability","type":"services"},{"content":" Why three brokers? # Each broker excels at a different messaging pattern:\nBroker Strength Use case Kafka Durable event streaming, replay, ordering Event sourcing, Knative Eventing backend NATS Ultra-low latency, lightweight pub/sub Service-to-service messaging, request/reply RabbitMQ Flexible routing, dead-letter queues Task queues, complex routing topologies Running all three is partly practical, partly educational — understanding the trade-offs between them is the point.\nKafka (Strimzi) # Strimzi manages a Kafka cluster:\nKRaft mode — no ZooKeeper dependency 3 combined controller+broker nodes — each node runs both roles Replication factor 3, min ISR 2 — tolerates one node failure 50 GiB Longhorn storage per node Kafbat UI at kafka.wcloud.sh with Keycloak OIDC auth Kafka also serves as the default broker backend for Knative Eventing — CloudEvents flow through Kafka topics.\nNATS # NATS runs as a 3-node cluster with JetStream enabled:\nJetStream provides persistence, exactly-once delivery, and key-value store 20 GiB Longhorn storage per node Lightweight and fast — ideal for internal service communication RabbitMQ # RabbitMQ is deployed via the RabbitMQ Cluster Operator (OLM-managed):\n3 replicas with topology spread constraints across nodes 20 GiB Longhorn storage per node OAuth2 backend plugin — authenticates management UI users via Keycloak Management UI at rabbitmq.wcloud.sh ","date":"1 March 2026","externalUrl":null,"permalink":"/services/messaging/","section":"Services","summary":"Three message brokers for three different patterns — Kafka, NATS, and RabbitMQ.","title":"Messaging","type":"services"},{"content":" CNI: Cilium # Cilium provides the primary CNI, installed by Talos/Omni (not managed in the GitOps repo). It runs in eBPF mode, which means network policy enforcement happens in the kernel without iptables.\nOne important configuration: cni.exclusive=false. This allows Multus to attach secondary network interfaces to pods — critical for the SR-IOV RDMA setup on the DGX Sparks.\nHubble provides network flow observability, accessible at hubble.wcloud.sh.\nLoad balancing: MetalLB # MetalLB runs in L2 mode, advertising service IPs via ARP on the home network.\nIP pool: 192.168.4.2 – 192.168.4.9\nA subtle challenge with MetalLB on this cluster: the DGX Sparks have two network interfaces — the 1 GbE enP7s7 (home network) and the 200 Gbps ConnectX-7 QSFP interface. MetalLB must only advertise on the home network interface, not the QSFP link. And the interface names differ between node types (enp12s0 on the Ryzen, enP7s7 on the DGX Sparks).\nThe solution is separate L2Advertisement resources per node type, each specifying the correct interface name via nodeSelectors and interfaces.\nIngress: kgateway (Gateway API) # kgateway provides the ingress layer using the Kubernetes Gateway API. It runs Envoy as the data plane.\nExternal VIP: 192.168.4.2 (from MetalLB) TLS termination with SNI-based routing HTTPRoutes define how traffic reaches each service All services are exposed under *.wcloud.sh. A few examples:\nService URL Grafana grafana.wcloud.sh ArgoCD argo-cd.wcloud.sh Keycloak keycloak.wcloud.sh Temporal temporal.wcloud.sh Longhorn longhorn.wcloud.sh TLS: cert-manager # cert-manager provisions TLS certificates from Let\u0026rsquo;s Encrypt using Cloudflare DNS-01 challenges. The Cloudflare API token is synced from Infisical via the External Secrets Operator.\nDNS: external-dns # external-dns watches Gateway API HTTPRoute resources and automatically creates/updates CNAME records in Cloudflare. When I add a new HTTPRoute, the DNS record appears within minutes.\nRemote access: Tailscale # The Tailscale Operator runs a subnet router (3 replicas, one per node) that advertises:\n192.168.4.0/24 — home network 10.96.0.0/12 — Kubernetes ClusterIP range 10.244.0.0/16 — Pod CIDR This means from any device on my Tailnet, I can reach any service, pod, or node in the cluster directly.\nSR-IOV and Multus # The DGX Spark QSFP interconnect uses a separate networking stack for RDMA:\nMultus CNI — attaches a secondary net1 interface to pods alongside the primary Cilium eth0 SR-IOV Device Plugin — creates Virtual Functions on the ConnectX-7 NICs and advertises them as nvidia.com/cx7_qsfp resources NetworkAttachmentDefinition — defines the dgx-qsfp network with IPAM on 10.100.0.0/24 and RDMA enabled This is covered in detail on the GPU \u0026amp; AI page and in the SR-IOV on Talos journal entry.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/networking/","section":"Infrastructure","summary":"Cilium, MetalLB, Gateway API, Tailscale, and a 200 Gbps RDMA interconnect.","title":"Networking","type":"infrastructure"},{"content":" Knative Serving # Knative Serving provides serverless compute — containers that auto-scale based on traffic, including scaling to zero when idle.\nDomain template: {name}-{namespace}.kn.wcloud.sh Networking: net-gateway-api bridges Knative to kgateway, so Knative Services get real Gateway API HTTPRoutes Managed by: Knative Operator Knative Eventing # Knative Eventing provides event-driven architecture with a CloudEvents contract.\nThe default broker is Kafka-backed via eventing-kafka-broker, which means:\nEvents are durably stored in Kafka topics Triggers subscribe to event types and route them to Knative Services The full Strimzi Kafka cluster provides the backbone graph LR Source[\"Event Source\"] --\u003e|CloudEvent| Broker[\"Knative Broker(Kafka-backed)\"] Broker --\u003e|Trigger filter| Svc1[\"Knative Service A\"] Broker --\u003e|Trigger filter| Svc2[\"Knative Service B\"] This integration is one of the more elegant parts of the cluster — serverless functions triggered by durable event streams.\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/serverless/","section":"Services","summary":"Knative Serving for scale-to-zero workloads, Eventing backed by Kafka.","title":"Serverless: Knative","type":"services"},{"content":" Longhorn # Longhorn provides distributed block storage with replicated volumes and snapshots.\nv1 data engine # The cluster runs with the v1 (iSCSI-based) data engine. The v2 engine (SPDK) is not in use.\nResource tuning # On the DGX Spark nodes (20 cores each), the default Longhorn instance manager CPU reservation was too aggressive. The setting guaranteedInstanceManagerCPU is tuned to 5 to reduce per-instance-manager reservation.\nConsumers # Stateful workloads using Longhorn PVCs:\nWorkload Storage per replica Kafka (Strimzi, 3 brokers) 50 GiB NATS (3 nodes) 20 GiB RabbitMQ (3 replicas) 20 GiB Grafana varies Web UI # The Longhorn dashboard is available at longhorn.wcloud.sh for volume management, snapshot creation, and monitoring.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/storage/","section":"Infrastructure","summary":"Longhorn v1 data engine for distributed block storage.","title":"Storage","type":"infrastructure"},{"content":" Coder # Coder provides Cloud Development Environments running on the cluster. Workspaces spin up as Kubernetes pods with persistent storage.\nBackend: Percona PostgreSQL URL: coder.wcloud.sh Temporal # Temporal provides durable workflow execution — workflows that survive process crashes, node failures, and deployments.\n3 server replicas + 2 web UI replicas Backend: Percona PostgreSQL Auth: Keycloak OIDC URL: temporal.wcloud.sh VPA-managed resource allocation GitHub Actions runners (ARC) # Actions Runner Controller provides self-hosted GitHub Actions runners:\nScale set Architecture Replicas AMD64 x86_64 0–4 (autoscaled) ARM64 aarch64 0–4 (autoscaled) Separate scale sets for each architecture reflect the mixed AMD64/ARM64 nature of the cluster. Authentication uses a GitHub App.\nHCP Terraform agents # Self-hosted Terraform Cloud agents for running Terraform plans and applies:\nPool Architecture Replicas AMD64 x86_64 3–10 ARM64 aarch64 3–10 Managed by the HCP Terraform Operator.\nCamel K + Kaoto # Apache Camel K provides cloud-native integration — deploy integration routes as lightweight serverless-style containers.\nKaoto adds a visual low-code designer for building integration flows.\nBoth deployed via OLM. URL: camel.wcloud.sh\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/dev-platform/","section":"Services","summary":"Coder CDEs, Temporal workflows, GitHub Actions runners, Terraform agents, and Camel K.","title":"Dev Platform","type":"services"},{"content":" FluxCD # FluxCD is the GitOps engine for the cluster. It continuously reconciles the desired state in Git with the actual state in the cluster.\nComponents # helm-controller kustomize-controller notification-controller source-controller Workflow # The workflow is strict:\nEdit YAML in the Git repository Commit and push to master FluxCD detects the change and reconciles The cluster converges to the desired state kubectl apply is never used directly. If something needs to change, it goes through Git.\nDependency management # FluxCD Kustomizations are organized with explicit dependsOn relationships. See Architecture for the full dependency graph.\nAutomated updates # Renovate runs on the repository to automatically propose updates to Helm chart versions and container image tags. Updates arrive as pull requests for review before merging.\nMonitoring # # Check Kustomization status flux get kustomizations # Check HelmRelease status across all namespaces flux get helmreleases -A # Follow reconciliation logs flux logs --all-namespaces --follow ","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/gitops/","section":"Infrastructure","summary":"FluxCD-driven reconciliation — if it’s not in Git, it doesn’t exist.","title":"GitOps","type":"infrastructure"},{"content":"All three Argo components run in the argo namespace.\nArgoCD # ArgoCD handles application-level GitOps deployments. While FluxCD manages the infrastructure layer (operators, CRDs, networking), ArgoCD is available for application workloads.\nAuth: Keycloak OIDC URL: argo-cd.wcloud.sh Argo Workflows # Argo Workflows provides DAG-based workflow execution on Kubernetes. Workflows are defined as Kubernetes CRs and execute as pods.\nAuth: Keycloak OIDC URL: argo-workflows.wcloud.sh Argo Events # Argo Events connects event sources to workflow triggers. Events from webhooks, message queues, or schedules can trigger Argo Workflows or Kubernetes resources.\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/argo/","section":"Services","summary":"ArgoCD for application GitOps, Argo Workflows for DAG execution, Argo Events for triggers.","title":"Argo Suite","type":"services"},{"content":"This is the most technically interesting part of the cluster, and the area where I\u0026rsquo;m actively learning the most. The goal is multi-node GPU training across the two DGX Sparks using the 200 Gbps ConnectX-7 interconnect with GPU-Direct RDMA.\nThe hardware stack # Each DGX Spark has:\nGB10 Grace Blackwell SoC — 1 PFLOP FP4, 128 GB unified memory ConnectX-7 NIC — 200 Gbps QSFP56, SR-IOV capable, RDMA (RoCE v2) The two units are connected point-to-point via a QSFP56 DAC cable. No switch, no hops.\nThe Ryzen node also has an RTX 5070 Ti (Blackwell GB203), giving the cluster an all-Blackwell GPU setup. This means uniform support for FP4/FP8 precision, Transformer Engine, and consistent CUDA compute capabilities across every node — useful for developing on the 5070 Ti and deploying to the DGX Sparks without compatibility concerns.\nThe software stack # Getting GPUs and RDMA working in Kubernetes on Talos Linux required assembling several components:\nNVIDIA Device Plugin # The NVIDIA Device Plugin advertises GPU resources to the Kubernetes scheduler. Pods request GPUs via nvidia.com/gpu: 1 in their resource spec. A corresponding RuntimeClass ensures GPU pods use the NVIDIA container runtime.\nSR-IOV Device Plugin (standalone) # This is where things get interesting. The standard approach for SR-IOV in Kubernetes is the SR-IOV Network Operator, but it doesn\u0026rsquo;t work on Talos Linux. The operator\u0026rsquo;s config daemon expects to write to /etc/sriov-operator/, which doesn\u0026rsquo;t exist on Talos\u0026rsquo;s read-only root filesystem.\nInstead, I built a standalone DaemonSet that:\nRuns an init container that creates 4 SR-IOV Virtual Functions on each DGX Spark\u0026rsquo;s ConnectX-7 NIC using sysfs Runs the SR-IOV Device Plugin container that discovers the VFs and advertises them to Kubernetes as nvidia.com/cx7_qsfp resources (4 per node) See the journal entry on SR-IOV on Talos for the full story.\nMultus CNI # Multus is a meta-CNI that wraps Cilium and enables pods to have multiple network interfaces. On the DGX Sparks, pods can get:\neth0 — primary interface via Cilium (home network) net1 — secondary interface via SR-IOV VF (200 Gbps QSFP link) Multus runs as a \u0026ldquo;thick plugin\u0026rdquo; in kube-system with Talos-specific patches for the /var/run/netns/ path and resource limits.\nNetworkAttachmentDefinition # The dgx-qsfp NetworkAttachmentDefinition configures the secondary interface:\nIPAM: host-local on subnet 10.100.0.0/24 RDMA: enabled Type: SR-IOV How it fits together # graph TD subgraph \"DGX Spark #1\" Pod1[\"Training Pod\"] --\u003e VF1[\"SR-IOV VFnet1: 10.100.0.x\"] Pod1 --\u003e GPU1[\"Blackwell GPU128 GB\"] VF1 --\u003e CX1[\"ConnectX-7PF\"] end subgraph \"DGX Spark #2\" Pod2[\"Training Pod\"] --\u003e VF2[\"SR-IOV VFnet1: 10.100.0.x\"] Pod2 --\u003e GPU2[\"Blackwell GPU128 GB\"] VF2 --\u003e CX2[\"ConnectX-7PF\"] end CX1 ---|\"200 Gbps QSFP DACRDMA / RoCE v2\"| CX2 When a pod on DGX Spark #1 needs to communicate with a pod on DGX Spark #2 for distributed training:\nNCCL detects the net1 RDMA interface Data transfers bypass the kernel via RDMA (RoCE v2) GPU-Direct RDMA allows the GPU to read/write directly to the NIC\u0026rsquo;s memory, skipping the CPU entirely Example pod spec # resources: requests: nvidia.com/gpu: 1 nvidia.com/cx7_qsfp: 1 limits: nvidia.com/gpu: 1 nvidia.com/cx7_qsfp: 1 With the annotation:\nannotations: k8s.v1.cni.cncf.io/networks: dgx-qsfp NCCL environment variables # NCCL_SOCKET_IFNAME=net1 NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=5 # GPU-Direct RDMA NCCL_DEBUG=INFO Current status # This is a work in progress. I\u0026rsquo;ve gotten:\nSR-IOV VFs created and advertised to Kubernetes Multus attaching secondary interfaces to pods Basic RDMA connectivity between pods on the two DGX Sparks What I\u0026rsquo;m still working on:\nValidating GPU-Direct RDMA end-to-end with NCCL benchmarks Tuning NCCL parameters for optimal throughput on the 200 Gbps link Building a reproducible multi-node training workflow Follow along in the journal.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/gpu/","section":"Infrastructure","summary":"NVIDIA DGX Spark GPU stack, SR-IOV RDMA, and multi-node training with NCCL.","title":"GPU \u0026 AI","type":"infrastructure"},{"content":"A complete reference of every component running in the cluster.\nCore platform # Component Namespace Notes Talos Linux — Immutable OS, managed via Omni Kubernetes — containerd runtime FluxCD flux-system helm-controller, kustomize-controller, source-controller, notification-controller Cilium kube-system eBPF CNI, installed by Talos/Omni Networking # Component Namespace Notes MetalLB metallb-system L2 mode, pool 192.168.4.2-9 kgateway kgateway Gateway API / Envoy cert-manager cert-manager Let\u0026rsquo;s Encrypt, Cloudflare DNS-01 external-dns external-dns Cloudflare provider Tailscale Operator tailscale Subnet router, 3 replicas Multus CNI kube-system Thick plugin, secondary interfaces GPU \u0026amp; SR-IOV # Component Namespace Notes NVIDIA Device Plugin kube-system GPU resource advertising SR-IOV Device Plugin kube-system Standalone (no operator), 4 VFs/node Storage # Component Namespace Notes Longhorn longhorn-system v1 data engine Identity \u0026amp; secrets # Component Namespace Notes Keycloak keycloak wcloud realm, OIDC SSO External Secrets Operator external-secrets Infisical Cloud backend Observability # Component Namespace Notes VictoriaMetrics K8s Stack vmks Metrics + Grafana + AlertManager OpenTelemetry Operator otel-system Manages Jaeger collector Jaeger jaeger Distributed tracing, Badger storage Headlamp headlamp K8s dashboard, OIDC via kgateway Messaging # Component Namespace Notes Strimzi (Kafka) kafka KRaft mode, 3 combined nodes NATS nats 3-node, JetStream enabled RabbitMQ rabbitmq OLM-managed, 3 replicas Serverless # Component Namespace Notes Knative Operator knative-operator Manages Serving + Eventing Knative Serving knative-serving *.kn.wcloud.sh, net-gateway-api Knative Eventing knative-eventing Kafka as default broker Dev platform # Component Namespace Notes Coder coder CDEs, Percona PostgreSQL backend Temporal temporal-system Durable workflows, OIDC, PostgreSQL GitHub ARC arc-systems AMD64 + ARM64 scale sets (0–4 each) HCP Terraform agents tfc-operator-system AMD64 + ARM64 pools (3–10 each) Camel K + Kaoto camel Cloud-native integration, OLM Argo # Component Namespace Notes ArgoCD argo GitOps, Keycloak OIDC Argo Workflows argo DAG workflows, Keycloak OIDC Argo Events argo Event-driven triggers Databases \u0026amp; caching # Component Namespace Notes Percona PostgreSQL Operator postgres Managed PostgreSQL clusters Dragonfly Operator dragonfly Redis-compatible, in-memory Autoscaling # Component Namespace Notes VPA — InPlaceOrRecreate update mode KEDA keda Event-driven autoscaling Other # Component Namespace Notes Kubernetes Replicator kube-system Secret/ConfigMap replication OLM olm Operator catalog Metrics Server kube-system Core resource metrics HCP Terraform Operator tfc-operator-system Terraform workspace management ","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/stack/","section":"Infrastructure","summary":"Every component running in the cluster, at a glance.","title":"Stack Reference","type":"infrastructure"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/dcgm/","section":"Tags","summary":"","title":"Dcgm","type":"tags"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/dgx-spark/","section":"Tags","summary":"","title":"Dgx-Spark","type":"tags"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/gpu/","section":"Tags","summary":"","title":"Gpu","type":"tags"},{"content":" The Setup # The cluster has three GPU nodes:\n2x DGX Spark — NVIDIA GB10 (Grace Blackwell SoC, ARM64, 128 GB unified memory) 1x Ryzen 9 workstation — NVIDIA GeForce RTX 5070 Ti (16 GB VRAM) I wanted a single Grafana dashboard that shows all three GPUs side by side, with clear labels for which hardware is which.\nDCGM Exporter # DCGM (Data Center GPU Manager) is NVIDIA\u0026rsquo;s tool for GPU telemetry. The dcgm-exporter Helm chart runs as a DaemonSet, exposing Prometheus-format metrics from every GPU node.\nThe key metrics it exposes:\nDCGM_FI_DEV_GPU_TEMP — GPU temperature DCGM_FI_DEV_POWER_USAGE — power draw in watts DCGM_FI_DEV_GPU_UTIL — GPU compute utilization DCGM_FI_DEV_SM_CLOCK — SM clock frequency DCGM_FI_DEV_FB_USED — framebuffer memory used DCGM_FI_PROF_PIPE_TENSOR_ACTIVE — tensor core utilization One nice thing about DCGM: every metric includes a modelName label, so you get NVIDIA GeForce RTX 5070 Ti or NVIDIA GB10 right in the time series data. No manual mapping needed.\nScraping with VictoriaMetrics # The cluster runs VictoriaMetrics (via the victoria-metrics-k8s-stack Helm chart) instead of Prometheus. Rather than using Prometheus ServiceMonitor CRDs (which don\u0026rsquo;t exist in this stack), I created a VMPodScrape:\napiVersion: operator.victoriametrics.com/v1beta1 kind: VMPodScrape metadata: name: dcgm-exporter namespace: dcgm-exporter spec: podMetricsEndpoints: - port: metrics path: /metrics relabelConfigs: - sourceLabels: [__meta_kubernetes_pod_node_name] targetLabel: instance metricRelabelConfigs: - action: labeldrop regex: exported_pod|exported_container|exported_namespace|pod|container|endpoint|prometheus selector: matchLabels: app.kubernetes.io/name: dcgm-exporter Two important details here:\nNode name as instance label. By default, the instance label is the pod IP + port (10.244.0.136:9400), which is meaningless. Relabeling it to __meta_kubernetes_pod_node_name gives you talos-76w-3r0 instead.\nDropping high-cardinality labels. DCGM reports which pod is using the GPU via exported_pod / exported_container labels. Every time a workload starts or stops, that creates a brand new time series. Combined with DaemonSet rollovers changing the pod label, you end up with an explosion of stale series. Dropping these labels keeps cardinality under control.\nThe Dashboard # I started from Grafana dashboard 12239 and customized it heavily.\nChanges from the stock dashboard:\nInstance variable queries Hostname label instead of instance All queries filter on Hostname and include instance!~\u0026quot;.*:.*\u0026quot; to exclude stale IP-based series Legend format uses {{modelName}} (GPU {{gpu}}) instead of just GPU {{gpu}} 2-column layout for the bottom panels (SM Clocks + Utilization, Tensor Cores + Framebuffer) GPU Power Total gauge rescaled from 2400W (data center scale) to 600W (homelab scale), with lastNotNull calc instead of sum All dropdowns default to \u0026ldquo;All\u0026rdquo; selected Note: the DGX Sparks don\u0026rsquo;t report DCGM_FI_PROF_PIPE_TENSOR_ACTIVE or DCGM_FI_DEV_FB_USED — the GB10\u0026rsquo;s unified memory architecture doesn\u0026rsquo;t support these counters.\nProvisioning via GitOps # The dashboard is provisioned as a Kubernetes ConfigMap with the grafana_dashboard: \u0026quot;1\u0026quot; label. Grafana\u0026rsquo;s sidecar container watches for ConfigMaps with this label and automatically loads them.\nThis approach avoids a known issue with the victoria-metrics-k8s-stack chart, which doesn\u0026rsquo;t allow grafana.sidecar.dashboards.enabled and grafana.dashboards to be set at the same time. By using a standalone ConfigMap instead of the chart\u0026rsquo;s built-in dashboard provisioning, the sidecar stays enabled and everything works.\nThe full dashboard JSON, VMPodScrape, and HelmRelease are all managed via FluxCD — no manual kubectl apply needed (well, after the initial debugging).\n","date":"2 March 2026","externalUrl":null,"permalink":"/posts/gpu-dashboard/","section":"Journal","summary":"Building a custom GPU monitoring dashboard for a mixed DGX Spark + RTX 5070 Ti cluster using DCGM exporter, VictoriaMetrics, and Grafana.","title":"GPU Observability with DCGM and Grafana","type":"posts"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/grafana/","section":"Tags","summary":"","title":"Grafana","type":"tags"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/categories/infrastructure/","section":"Categories","summary":"","title":"Infrastructure","type":"categories"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/posts/","section":"Journal","summary":"","title":"Journal","type":"posts"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/nvidia/","section":"Tags","summary":"","title":"Nvidia","type":"tags"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/observability/","section":"Tags","summary":"","title":"Observability","type":"tags"},{"content":"","date":"2 March 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"A 3-node Kubernetes cluster running on Talos Linux, managed entirely through GitOps with FluxCD.\nNote: This site is a living document. I use it to track what I\u0026rsquo;m building, what I\u0026rsquo;m learning, and what I still don\u0026rsquo;t understand.\n🖥️ The hardware # Node Hardware CPU RAM Role talos-76w-3r0 Custom PC AMD Ryzen 9 9950X3D (16c/32t) 192 GB DDR5 Control plane + workloads + GPU talos-7aj-lwl NVIDIA DGX Spark GB10 Grace Blackwell (20c ARM) 128 GB Control plane + GPU workloads talos-ysi-4k0 NVIDIA DGX Spark GB10 Grace Blackwell (20c ARM) 128 GB Control plane + GPU workloads Totals: 56 cores, 448 GB RAM, ~3.4 PFLOPS FP4 AI compute.\nThe two DGX Sparks are connected via a 200 Gbps QSFP DAC direct-attach cable using Mellanox ConnectX-7 NICs with SR-IOV, enabling GPU-Direct RDMA for multi-node training with NCCL.\n🚀 What\u0026rsquo;s running # The cluster runs a full platform stack — not because everything is needed, but because each component is something I wanted to understand deeply.\nGitOps: FluxCD reconciles everything from Git. No kubectl apply, ever.\nNetworking: Cilium (eBPF CNI), MetalLB (L2), kgateway (Gateway API / Envoy), Tailscale subnet routing\nStorage: Longhorn (v1 data engine)\nIdentity: Keycloak providing OIDC SSO for every web UI in the cluster\nObservability: VictoriaMetrics, Grafana, Jaeger (OpenTelemetry), Hubble\nMessaging: Kafka (Strimzi, KRaft mode), NATS (JetStream), RabbitMQ\nServerless: Knative Serving + Eventing (backed by Kafka)\nDev Platform: Coder, Temporal, GitHub Actions runners (ARC), HCP Terraform agents, Camel K\nGPU/AI: NVIDIA device plugin, SR-IOV device plugin, Multus CNI for RDMA interfaces\n🧠 What I\u0026rsquo;m learning right now # My current focus is on RDMA and SR-IOV — figuring out how to get multi-node GPU training working efficiently across the two DGX Sparks using the 200 Gbps ConnectX-7 interconnect. It\u0026rsquo;s been a deep rabbit hole involving:\nSR-IOV Virtual Functions on Talos Linux (which has a read-only root filesystem)\nMultus CNI for secondary network interfaces\nNCCL configuration for GPU-Direct RDMA\nThe gap between \u0026ldquo;it works in a single node\u0026rdquo; and \u0026ldquo;it works across nodes\u0026rdquo;\nRead more in the journal.\n🏗️ Design philosophy # A few deliberate choices:\nAll nodes are control-plane. Three nodes, all running both control plane and workloads. Typical for a homelab, but it means every component needs to tolerate this. Mixed architecture. AMD64 (Ryzen) and ARM64 (DGX Spark). CI runners and Terraform agents have separate scale sets for each arch. GitOps only. Every change goes through Git. FluxCD reconciles. If it\u0026rsquo;s not in the repo, it doesn\u0026rsquo;t exist. ","date":"2 March 2026","externalUrl":null,"permalink":"/","section":"Willingham Cloud","summary":"A 3-node Kubernetes cluster running on Talos Linux, managed entirely through GitOps with FluxCD.\nNote: This site is a living document. I use it to track what I’m building, what I’m learning, and what I still don’t understand.\n🖥️ The hardware # Node Hardware CPU RAM Role talos-76w-3r0 Custom PC AMD Ryzen 9 9950X3D (16c/32t) 192 GB DDR5 Control plane + workloads + GPU talos-7aj-lwl NVIDIA DGX Spark GB10 Grace Blackwell (20c ARM) 128 GB Control plane + GPU workloads talos-ysi-4k0 NVIDIA DGX Spark GB10 Grace Blackwell (20c ARM) 128 GB Control plane + GPU workloads Totals: 56 cores, 448 GB RAM, ~3.4 PFLOPS FP4 AI compute.\n","title":"Willingham Cloud","type":"page"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/5070-ti/","section":"Tags","summary":"","title":"5070-Ti","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/arc/","section":"Tags","summary":"","title":"Arc","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/architecture/","section":"Tags","summary":"","title":"Architecture","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/argo-events/","section":"Tags","summary":"","title":"Argo-Events","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/argo-workflows/","section":"Tags","summary":"","title":"Argo-Workflows","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/argocd/","section":"Tags","summary":"","title":"Argocd","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/backlog/","section":"Backlog","summary":"","title":"Backlog","type":"backlog"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/blackwell/","section":"Tags","summary":"","title":"Blackwell","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/camel-k/","section":"Tags","summary":"","title":"Camel-K","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/cilium/","section":"Tags","summary":"","title":"Cilium","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/coder/","section":"Tags","summary":"","title":"Coder","type":"tags"},{"content":" The problem # Every time I add a new service that needs OIDC, I have to manually create a Keycloak client — essentially running a script against the Keycloak admin API. The client ID, redirect URIs, and scopes aren\u0026rsquo;t tracked in Git. It works, but it\u0026rsquo;s hacky and fragile.\nI already have 7 OIDC consumers in the cluster, and the list keeps growing. Each one requires creating the client, setting redirect URIs, configuring scopes, and syncing the client secret into Infisical so ESO can distribute it. None of that is declarative or gitops-friendly.\nWhat I want to build # A Kubernetes operator in Go that watches a CRD like:\napiVersion: keycloak.wcloud.sh/v1alpha1 kind: KeycloakClient metadata: name: grafana namespace: keycloak spec: realmRef: wcloud clientId: grafana protocol: openid-connect redirectUris: - \u0026#34;https://grafana.wcloud.sh/login/generic_oauth\u0026#34; defaultScopes: - openid - profile - email The operator would reconcile this against the Keycloak admin API — creating, updating, or deleting clients to match the desired state.\nWhy I want to do this # Two reasons:\nSolve a real problem — get Keycloak client management into Git where it belongs, alongside everything else in the cluster.\nLearn to build a proper Kubernetes operator — I\u0026rsquo;ve never written one from scratch. I want to learn the modern Go stack: kubebuilder or operator-sdk, controller-runtime, custom resource definitions, finalizers, status conditions, the whole reconciliation loop. This feels like the right project for it because the scope is well-defined and I\u0026rsquo;ll actually use it every day.\nOpen questions # kubebuilder vs operator-sdk — which one is the better starting point in 2026? They share a lot of code (both use controller-runtime), but I need to pick one to scaffold with. Client secret lifecycle — should the operator generate the client secret and write it to a Kubernetes Secret, or should it integrate with Infisical/ESO? Probably both as options. Existing operators — there are community Keycloak operators that handle client management (like the one from Epam). Worth evaluating whether one of those is good enough, or if the learning value justifies building my own. Realm management — should I keep scope tight (just clients) or also handle realm configuration, identity providers, and client scopes? ","date":"1 March 2026","externalUrl":null,"permalink":"/backlog/keycloak-operator/","section":"Backlog","summary":"Build a Kubernetes operator in Go to declaratively manage Keycloak OIDC clients via CRDs.","title":"Custom Keycloak Client Operator","type":"backlog"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/fluxcd/","section":"Tags","summary":"","title":"Fluxcd","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/gateway-api/","section":"Tags","summary":"","title":"Gateway-Api","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/gitops/","section":"Tags","summary":"","title":"Gitops","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/golang/","section":"Tags","summary":"","title":"Golang","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/hardware/","section":"Tags","summary":"","title":"Hardware","type":"tags"},{"content":"The foundational layers of the cluster — hardware, networking, storage, and the GitOps workflow that ties everything together.\n","date":"1 March 2026","externalUrl":null,"permalink":"/infrastructure/","section":"Infrastructure","summary":"The foundational layers of the cluster — hardware, networking, storage, and the GitOps workflow that ties everything together.\n","title":"Infrastructure","type":"infrastructure"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/jaeger/","section":"Tags","summary":"","title":"Jaeger","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/kafka/","section":"Tags","summary":"","title":"Kafka","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/keycloak/","section":"Tags","summary":"","title":"Keycloak","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/knative/","section":"Tags","summary":"","title":"Knative","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/kubernetes/","section":"Tags","summary":"","title":"Kubernetes","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/categories/learning/","section":"Categories","summary":"","title":"Learning","type":"categories"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/longhorn/","section":"Tags","summary":"","title":"Longhorn","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/messaging/","section":"Tags","summary":"","title":"Messaging","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/metallb/","section":"Tags","summary":"","title":"Metallb","type":"tags"},{"content":" Why DGX Spark? # I\u0026rsquo;d been running a single-node homelab for a while — the Ryzen 9 build. It handled everything I threw at it, but I kept bumping into the same wall: multi-node GPU training is fundamentally different from single-node, and I couldn\u0026rsquo;t learn it without actually doing it.\nThe DGX Spark hit a sweet spot:\nDesktop form factor (no rack, no 240V circuit) Grace Blackwell SoC — ARM64 CPU and Blackwell GPU with 128 GB unified memory ConnectX-7 NIC — 200 Gbps, SR-IOV, RDMA-capable The GPU and NIC can talk directly via GPU-Direct RDMA I bought two. Connected them with a QSFP56 DAC cable. And then the real learning started.\nWhat is RDMA? # RDMA (Remote Direct Memory Access) lets one machine read from or write to another machine\u0026rsquo;s memory without involving the remote CPU. The NIC handles the transfer directly.\nIn traditional networking:\nApplication calls send() Data copies from user space to kernel buffer Kernel hands it to the NIC NIC transmits Remote NIC receives Remote kernel copies to kernel buffer Remote kernel copies to user-space application With RDMA:\nApplication registers a memory region with the NIC NIC reads directly from application memory and transmits Remote NIC writes directly to the remote application\u0026rsquo;s memory No kernel involvement on the data path. No copies. The CPU is free to do other work while data moves at line rate.\nRoCE v2 # There are several RDMA transports. The ConnectX-7 supports RoCE v2 (RDMA over Converged Ethernet), which runs RDMA over standard Ethernet with UDP encapsulation. This means I don\u0026rsquo;t need InfiniBand hardware — just Ethernet (or in my case, a direct QSFP cable).\nWhy this matters for GPU training # Distributed training involves two main communication patterns:\nAll-Reduce — every GPU needs to aggregate gradients from every other GPU after each training step All-Gather — redistributing updated parameters to all GPUs With GPU-Direct RDMA, NCCL (NVIDIA\u0026rsquo;s collective communication library) can move data between GPUs on different nodes without touching the CPU or system memory:\nGPU (Node 1) → NIC (Node 1) → [200 Gbps link] → NIC (Node 2) → GPU (Node 2) The GPU\u0026rsquo;s memory is directly accessible to the NIC. No staging through CPU memory. This is critical for training performance — the faster the inter-node communication, the less time GPUs spend idle waiting for gradient synchronization.\nWhere I\u0026rsquo;m at # Working # Both DGX Sparks are running Talos Linux and joined to the Kubernetes cluster The ConnectX-7 NICs are detected and the QSFP link is up at 200 Gbps SR-IOV Virtual Functions are created via a custom init container (Talos\u0026rsquo;s read-only rootfs prevents the standard SR-IOV Network Operator from working) Multus attaches a secondary net1 interface to pods, connected to the QSFP link Basic pod-to-pod communication works across the 200 Gbps link In progress # Validating RDMA operations (ibv_rc_pingpong, etc.) between pods Getting NCCL to detect and use the RDMA interface Running NCCL benchmarks (all_reduce_perf) to measure actual throughput Open questions # What\u0026rsquo;s the real-world NCCL throughput I should expect on this link? The theoretical max is 200 Gbps (25 GB/s), but protocol overhead, message size, and GPU-Direct efficiency all affect this. Do I need to tune any RoCE v2 parameters (ECN, PFC) for optimal performance on a direct point-to-point link? Most RoCE tuning guides assume a switched fabric. How does the unified memory architecture of the GB10 interact with GPU-Direct RDMA? The GPU and CPU share the same physical memory — does this change how GPU-Direct works compared to discrete GPU systems? These are the questions I\u0026rsquo;m actively investigating. I\u0026rsquo;ll update this series as I figure them out.\n","date":"1 March 2026","externalUrl":null,"permalink":"/posts/rdma-journey/","section":"Journal","summary":"Why I bought two DGX Sparks, what RDMA actually is, and where I’m at with getting it working in Kubernetes.","title":"My RDMA Journey with DGX Spark","type":"posts"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/nats/","section":"Tags","summary":"","title":"Nats","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/nccl/","section":"Tags","summary":"","title":"Nccl","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/networking/","section":"Tags","summary":"","title":"Networking","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/oidc/","section":"Tags","summary":"","title":"Oidc","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/opentelemetry/","section":"Tags","summary":"","title":"Opentelemetry","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/operator/","section":"Tags","summary":"","title":"Operator","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/rabbitmq/","section":"Tags","summary":"","title":"Rabbitmq","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/rdma/","section":"Tags","summary":"","title":"Rdma","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/series/rdma-on-kubernetes/","section":"Series","summary":"","title":"RDMA on Kubernetes","type":"series"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/reference/","section":"Tags","summary":"","title":"Reference","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/series/","section":"Series","summary":"","title":"Series","type":"series"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/serverless/","section":"Tags","summary":"","title":"Serverless","type":"tags"},{"content":"Application-layer services built on top of the infrastructure — identity, observability, messaging, serverless, and developer tooling.\n","date":"1 March 2026","externalUrl":null,"permalink":"/services/","section":"Services","summary":"Application-layer services built on top of the infrastructure — identity, observability, messaging, serverless, and developer tooling.\n","title":"Services","type":"services"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/sr-iov/","section":"Tags","summary":"","title":"Sr-Iov","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/sso/","section":"Tags","summary":"","title":"Sso","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/storage/","section":"Tags","summary":"","title":"Storage","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/tailscale/","section":"Tags","summary":"","title":"Tailscale","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/temporal/","section":"Tags","summary":"","title":"Temporal","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/terraform/","section":"Tags","summary":"","title":"Terraform","type":"tags"},{"content":"","date":"1 March 2026","externalUrl":null,"permalink":"/tags/victoriametrics/","section":"Tags","summary":"","title":"Victoriametrics","type":"tags"},{"content":" The problem # The standard way to expose SR-IOV devices in Kubernetes is the SR-IOV Network Operator. It handles:\nDiscovering SR-IOV-capable NICs Creating Virtual Functions (VFs) on those NICs Running the SR-IOV device plugin to advertise VFs to Kubernetes Managing NetworkAttachmentDefinitions for Multus Sounds perfect. Except it crashes on Talos Linux.\nThe root cause # The operator\u0026rsquo;s sriov-config-daemon assumes it can write to /etc/sriov-operator/ and other paths on the host filesystem. On a normal Linux distro, this works fine. On Talos, the root filesystem is read-only and immutable. There\u0026rsquo;s no /etc/sriov-operator/, and you can\u0026rsquo;t create it.\nThe daemon enters a crash loop:\nerror creating /etc/sriov-operator/config.yaml: read-only file system I looked into patching the operator to use a different path, but the assumption of a writable /etc/ is baked deep into the codebase. It wasn\u0026rsquo;t a quick fix.\nThe solution: standalone device plugin # Instead of fighting the operator, I stripped it down to just the parts I needed and built a standalone DaemonSet.\nStep 1: Create VFs with an init container # SR-IOV Virtual Functions are created by writing to sysfs, which is writable on Talos (it\u0026rsquo;s a kernel interface, not a filesystem):\n# Enable SR-IOV and create 4 VFs on the ConnectX-7 echo 4 \u0026gt; /sys/class/net/\u0026lt;pf-name\u0026gt;/device/sriov_numvfs The init container:\nFinds the ConnectX-7 PF (Physical Function) on the DGX Spark nodes Writes to sriov_numvfs to create 4 Virtual Functions Waits for the VF netdevs to appear This runs as a privileged init container in the DaemonSet, with a nodeSelector targeting only the DGX Spark nodes.\nStep 2: Run the device plugin # The main container in the DaemonSet runs the SR-IOV Network Device Plugin directly, configured to discover the VFs and advertise them as nvidia.com/cx7_qsfp resources.\nThe device plugin config:\n{ \u0026#34;resourceList\u0026#34;: [ { \u0026#34;resourceName\u0026#34;: \u0026#34;cx7_qsfp\u0026#34;, \u0026#34;resourcePrefix\u0026#34;: \u0026#34;nvidia.com\u0026#34;, \u0026#34;selectors\u0026#34;: { \u0026#34;vendors\u0026#34;: [\u0026#34;15b3\u0026#34;], \u0026#34;devices\u0026#34;: [\u0026#34;101e\u0026#34;], \u0026#34;drivers\u0026#34;: [\u0026#34;mlx5_core\u0026#34;], \u0026#34;pfNames\u0026#34;: [\u0026#34;\u0026lt;pf-name\u0026gt;#0-3\u0026#34;] } } ] } This discovers VFs 0–3 on the ConnectX-7 PF and registers them with Kubernetes.\nStep 3: NetworkAttachmentDefinition # A NetworkAttachmentDefinition tells Multus how to configure the secondary interface:\napiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: dgx-qsfp namespace: kube-system spec: config: | { \u0026#34;cniVersion\u0026#34;: \u0026#34;0.3.1\u0026#34;, \u0026#34;name\u0026#34;: \u0026#34;dgx-qsfp\u0026#34;, \u0026#34;type\u0026#34;: \u0026#34;sriov\u0026#34;, \u0026#34;ipam\u0026#34;: { \u0026#34;type\u0026#34;: \u0026#34;host-local\u0026#34;, \u0026#34;subnet\u0026#34;: \u0026#34;10.100.0.0/24\u0026#34; }, \u0026#34;rdma\u0026#34;: true } RDMA is explicitly enabled, and IPAM assigns addresses from 10.100.0.0/24.\nThe result # After deploying the DaemonSet:\n$ kubectl get nodes -o custom-columns=NAME:.metadata.name,QSFP:.status.allocatable.nvidia\\\\.com/cx7_qsfp NAME QSFP talos-76w-3r0 \u0026lt;none\u0026gt; talos-7aj-lwl 4 talos-ysi-4k0 4 The DGX Spark nodes each advertise 4 nvidia.com/cx7_qsfp resources. Pods can request one to get a secondary network interface on the 200 Gbps QSFP link.\nLessons learned # Don\u0026rsquo;t fight the operator. If an operator assumes a writable rootfs and your OS doesn\u0026rsquo;t have one, building a simpler standalone solution is faster than patching the operator. sysfs is your friend on Talos. The kernel interface is always writable, even when the root filesystem isn\u0026rsquo;t. SR-IOV VF creation works through sysfs, so we can work around the read-only /etc/ restriction. The device plugin is the easy part. The SR-IOV device plugin is a relatively simple binary that watches sysfs for VFs and registers them with the kubelet. The complexity in the operator is mostly around automation that we handled with a simpler init container approach. Init containers for hardware setup. Using an init container to configure SR-IOV VFs is clean — it runs once at pod startup, before the device plugin starts discovering VFs. The ordering guarantee is exactly what we need. ","date":"15 February 2026","externalUrl":null,"permalink":"/posts/sriov-on-talos/","section":"Journal","summary":"The SR-IOV Network Operator crashes on Talos Linux. Here’s the standalone device plugin I built instead.","title":"SR-IOV on Talos: Why the Operator Doesn't Work","type":"posts"},{"content":"","date":"15 February 2026","externalUrl":null,"permalink":"/tags/talos/","section":"Tags","summary":"","title":"Talos","type":"tags"},{"content":" About this site # This is a documentation site and learning journal for willingham-k8s, a homelab Kubernetes cluster I built to learn about GPU infrastructure, RDMA networking, and platform engineering at a level deeper than what you can get from managed cloud services.\nThe cluster runs real workloads — development environments, CI runners, workflow orchestration, event-driven processing — on hardware I own and manage. When something breaks, there\u0026rsquo;s no support ticket to file. That\u0026rsquo;s the point.\nAbout the cluster # 3 nodes: 1 custom AMD64 PC + 2 NVIDIA DGX Sparks (ARM64) 448 GB RAM, ~3.4 PFLOPS FP4 AI compute 200 Gbps RDMA interconnect between the DGX Sparks Talos Linux — immutable, API-driven, no SSH GitOps via FluxCD — every change goes through Git Keycloak SSO for every service See the infrastructure section for the full breakdown.\nAbout me # I\u0026rsquo;m Michael Willingham — software engineer on the Fleet Automation team at NMC² (NorthMark Compute \u0026amp; Cloud), a private data center and high-performance compute infrastructure company. You might have spotted the logo on a Williams Racing F1 car. Outside of work, I run this homelab to go deeper on the same kinds of problems: Kubernetes, GPU scheduling, RDMA networking, and platform engineering.\nGitHub LinkedIn About the tech # This site is built with Hugo and the Blowfish theme, deployed to GitHub Pages via GitHub Actions. Source is at willingham-cloud/lab.wcloud.sh.\n","externalUrl":null,"permalink":"/about/","section":"About","summary":"About this site # This is a documentation site and learning journal for willingham-k8s, a homelab Kubernetes cluster I built to learn about GPU infrastructure, RDMA networking, and platform engineering at a level deeper than what you can get from managed cloud services.\nThe cluster runs real workloads — development environments, CI runners, workflow orchestration, event-driven processing — on hardware I own and manage. When something breaks, there’s no support ticket to file. That’s the point.\n","title":"About","type":"about"}]