# VLAN Node-IP Migration Plan ## Document Purpose This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security. ## Current State (2025-11-20) ### Cluster Status - **n1** (control plane): `` - Ready ✅ - **n2** (worker): `` - Ready ✅ - **n3** (worker): `` - Ready ✅ ### Current Configuration All nodes are using **external IPs** for `node-ip`: - n1: `node-ip: ` - n2: `node-ip: ` - n3: `node-ip: ` ### Issues with Current Setup 1. ❌ Inter-node pod traffic uses **public internet** (external IPs) 2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused** 3. ❌ Less secure (traffic exposed on public network) 4. ❌ Potentially slower for inter-pod communication ### What's Working 1. ✅ All nodes joined and operational 2. ✅ Cilium CNI deployed and functional 3. ✅ Global Talos API access enabled (ports 50000, 50001) 4. ✅ GitOps with Flux operational 5. ✅ Core infrastructure recovering ## Goal: VLAN Migration ### Target Configuration All nodes using **VLAN IPs** for `node-ip`: - n1: `` (control plane) - n2: `` (worker) - n3: `` (worker) ### Benefits 1. ✅ 100Mbps dedicated bandwidth for inter-node traffic 2. ✅ Private network (more secure) 3. ✅ Lower latency for pod-to-pod communication 4. ✅ Production-ready architecture ## Issues Encountered During Initial Attempt ### Issue 1: API Server Endpoint Mismatch **Problem:** - `api.keyboardvagabond.com` resolves to n1's external IP (``) - Worker nodes with VLAN node-ip couldn't reach API server - n3 failed to join cluster **Solution:** Must choose ONE of: - **Option A:** Set `cluster.controlPlane.endpoint: https://:6443` in ALL machine configs - **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `` (VLAN IP) **Recommended:** Option A (simpler, no DNS changes needed) ### Issue 2: Cluster Lockout After n1 Migration **Problem:** - When n1 was changed to VLAN node ip, all access was lost - Tailscale pods couldn't start (needed API server access) - Cilium policies blocked external Talos API access - Complete lockout - no `kubectl` or `talosctl` access **Root Cause:** - Tailscale requires API server to be reachable from external network - Once n1 switched to VLAN-only, Tailscale couldn't connect - Without Tailscale, no VPN access to cluster **Solution:** - ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies - This prevents future lockouts during network migrations ### Issue 3: etcd Data Loss After Bootstrap **Problem:** - After multiple reboots/config changes, etcd lost its data - `/var/lib/etcd/member` directory was empty - etcd stuck waiting to join cluster **Solution:** - Ran `talosctl bootstrap` to reinitialize etcd - GitOps (Flux) automatically redeployed all workloads from Git - Longhorn has S3 backups for persistent data recovery ### Issue 4: Machine Config Format Issues **Problem:** - `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper) - YAML indentation errors in various config files - SOPS encryption complications **Solution:** - Use `.decrypted~` files for direct manipulation - Careful YAML indentation (list items with inline keys) - Apply configs in maintenance mode with `--insecure` flag ## Migration Plan: Phased VLAN Rollout ### Prerequisites 1. ✅ All nodes in stable, working state (DONE) 2. ✅ Global Talos API access enabled (DONE) 3. ✅ GitOps with Flux operational (DONE) 4. ⏳ Verify Longhorn S3 backups are current 5. ⏳ Document current pod placement and workload state ### Phase 1: Prepare Configurations #### 1.1 Update Machine Configs for VLAN For each node, update the machine config: **n1 (control plane):** ```yaml machine: kubelet: nodeIP: validSubnets: - 10.132.0.0/24 # Force VLAN IP selection ``` **n2 & n3 (workers):** ```yaml cluster: controlPlane: endpoint: https://:6443 # Use n1's VLAN IP machine: kubelet: nodeIP: validSubnets: - 10.132.0.0/24 # Force VLAN IP selection ``` #### 1.2 Update Cilium Configuration Verify Cilium is configured to use VLAN interface: ```yaml # manifests/infrastructure/cilium/release.yaml values: kubeProxyReplacement: strict # Ensure Cilium detects and uses VLAN interface ``` ### Phase 2: Test with Worker Node First #### 2.1 Migrate n3 (Worker Node) Test VLAN migration on a worker node first: ```bash # Apply updated config to n3 cd /Users//src/keyboard-vagabond talosctl -e -n apply-config \ --file machineconfigs/n3-vlan.yaml # Wait for n3 to reboot sleep 60 # Verify n3 joined with VLAN IP kubectl get nodes -o wide # Should show: n3 INTERNAL-IP: ``` #### 2.2 Validate n3 Connectivity ```bash # Check Cilium status on n3 kubectl exec -n kube-system ds/cilium -- cilium status # Verify pod-to-pod communication kubectl run test-pod --image=nginx --rm -it -- curl # Check inter-node traffic is using VLAN talosctl -e -n read /proc/net/dev | grep enp9s0 ``` #### 2.3 Decision Point - ✅ If successful: Proceed to Phase 3 - ❌ If issues: Revert n3 to external IP (rollback plan) ### Phase 3: Migrate Second Worker (n2) Repeat Phase 2 steps for n2: ```bash talosctl -e -n apply-config \ --file machineconfigs/n2-vlan.yaml ``` Validate connectivity and inter-node traffic on VLAN. ### Phase 4: Migrate Control Plane (n1) **CRITICAL:** This is the most sensitive step. #### 4.1 Prepare for Downtime - ⚠️ **Expected downtime:** 2-5 minutes - Inform users of maintenance window - Ensure workers (n2, n3) are stable #### 4.2 Apply Config to n1 ```bash talosctl -e -n apply-config \ --file machineconfigs/n1-vlan.yaml ``` #### 4.3 Monitor API Server Recovery ```bash # Watch for API server to come back online watch -n 2 "kubectl get nodes" # Check etcd health talosctl -e -n service etcd status # Verify all nodes on VLAN kubectl get nodes -o wide ``` ### Phase 5: Validation & Verification #### 5.1 Verify VLAN Traffic ```bash # Check network traffic on VLAN interface (enp9s0) for node in ; do echo "=== $node ===" talosctl -e $node -n $node read /proc/net/dev | grep enp9s0 done ``` #### 5.2 Verify Pod Connectivity ```bash # Deploy test pods across nodes kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}' kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}' kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}' # Test cross-node communication kubectl exec test-n1 -- curl kubectl exec test-n2 -- curl ``` #### 5.3 Monitor for 24 Hours - Watch for network issues - Monitor Longhorn replication - Check application logs - Verify external services (Mastodon, Pixelfed, etc.) ## Rollback Plan ### If Issues Occur During Migration #### Rollback Individual Node ```bash # Create rollback config with external IP # Apply to affected node talosctl -e -n apply-config \ --file machineconfigs/-external.yaml ``` #### Complete Cluster Rollback If systemic issues occur: 1. Revert n1 first (control plane is critical) 2. Revert n2 and n3 3. Verify all nodes back on external IPs 4. Investigate root cause before retry ### Emergency Recovery (If Locked Out) If you lose access during migration: 1. **Access via NetCup Console:** - Boot node into maintenance mode via NetCup dashboard - Apply rollback config with `--insecure` flag 2. **Rescue Mode (Last Resort):** - Boot into NetCup rescue system - Mount XFS partitions (need `xfsprogs`) - Manually edit configs (complex, avoid if possible) ## Key Talos Configuration References ### Multihoming Configuration According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming): ```yaml machine: kubelet: nodeIP: validSubnets: - 10.132.0.0/24 # Selects IP from VLAN subnet ``` ### Kubelet node-ip Setting From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/): - `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack) - Controls which IP kubelet advertises to API server - Determines routing for pod-to-pod traffic ### Network Connectivity Requirements Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/): **Control Plane Nodes:** - TCP 50000: apid (used by talosctl, control plane nodes) - TCP 50001: trustd (used by worker nodes) **Worker Nodes:** - TCP 50000: apid (used by control plane nodes) ## Lessons Learned ### What Went Wrong 1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies 2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout 3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration 4. **Config file format confusion** - Multiple formats caused application errors ### What Went Right 1. ✅ **Global Talos API access** - Prevents future lockouts 2. ✅ **GitOps with Flux** - Automatic workload recovery after etcd bootstrap 3. ✅ **Maintenance mode recovery** - Reliable way to regain access 4. ✅ **External IP baseline** - Stable configuration to fall back to ### Best Practices Going Forward 1. **Test on workers first** - Validate VLAN setup before touching control plane 2. **Document all configs** - Keep clear record of working configurations 3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage 4. **Backup etcd** - Regular etcd backups to avoid data loss 5. **Plan for downtime** - Maintenance windows for control plane changes ## Success Criteria Migration is successful when: 1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide` 2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface) 3. ✅ All pods healthy and communicating 4. ✅ Longhorn replication working 5. ✅ External services (Mastodon, Pixelfed, etc.) operational 6. ✅ No performance degradation 7. ✅ 24-hour stability test passed ## Additional Resources - [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming) - [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes) - [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/) - [Cilium Documentation](https://docs.cilium.io/) ## Contact & Maintenance **Last Updated:** 2025-11-20 **Cluster:** keyboardvagabond.com **Status:** Nodes operational on external IPs, VLAN migration pending