Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
11 KiB
VLAN Node-IP Migration Plan
Document Purpose
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.
Current State (2025-11-20)
Cluster Status
- n1 (control plane):
<NODE_1_EXTERNAL_IP>- Ready ✅ - n2 (worker):
<NODE_2_EXTERNAL_IP>- Ready ✅ - n3 (worker):
<NODE_3_EXTERNAL_IP>- Ready ✅
Current Configuration
All nodes are using external IPs for node-ip:
- n1:
node-ip: <NODE_1_EXTERNAL_IP> - n2:
node-ip: <NODE_2_EXTERNAL_IP> - n3:
node-ip: <NODE_3_EXTERNAL_IP>
Issues with Current Setup
- ❌ Inter-node pod traffic uses public internet (external IPs)
- ❌ VLAN bandwidth (100Mbps dedicated) is unused
- ❌ Less secure (traffic exposed on public network)
- ❌ Potentially slower for inter-pod communication
What's Working
- ✅ All nodes joined and operational
- ✅ Cilium CNI deployed and functional
- ✅ Global Talos API access enabled (ports 50000, 50001)
- ✅ GitOps with Flux operational
- ✅ Core infrastructure recovering
Goal: VLAN Migration
Target Configuration
All nodes using VLAN IPs for node-ip:
- n1:
<NODE_1_IP>(control plane) - n2:
<NODE_2_IP>(worker) - n3:
<NODE_3_IP>(worker)
Benefits
- ✅ 100Mbps dedicated bandwidth for inter-node traffic
- ✅ Private network (more secure)
- ✅ Lower latency for pod-to-pod communication
- ✅ Production-ready architecture
Issues Encountered During Initial Attempt
Issue 1: API Server Endpoint Mismatch
Problem:
api.keyboardvagabond.comresolves to n1's external IP (<NODE_1_EXTERNAL_IP>)- Worker nodes with VLAN node-ip couldn't reach API server
- n3 failed to join cluster
Solution: Must choose ONE of:
- Option A: Set
cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443in ALL machine configs - Option B: Update DNS so
api.keyboardvagabond.comresolves to<NODE_1_IP>(VLAN IP)
Recommended: Option A (simpler, no DNS changes needed)
Issue 2: Cluster Lockout After n1 Migration
Problem:
- When n1 was changed to VLAN node ip, all access was lost
- Tailscale pods couldn't start (needed API server access)
- Cilium policies blocked external Talos API access
- Complete lockout - no
kubectlortalosctlaccess
Root Cause:
- Tailscale requires API server to be reachable from external network
- Once n1 switched to VLAN-only, Tailscale couldn't connect
- Without Tailscale, no VPN access to cluster
Solution:
- ✅ Enabled global Talos API access (ports 50000, 50001) in Cilium policies
- This prevents future lockouts during network migrations
Issue 3: etcd Data Loss After Bootstrap
Problem:
- After multiple reboots/config changes, etcd lost its data
/var/lib/etcd/memberdirectory was empty- etcd stuck waiting to join cluster
Solution:
- Ran
talosctl bootstrapto reinitialize etcd - GitOps (Flux) automatically redeployed all workloads from Git
- Longhorn has S3 backups for persistent data recovery
Issue 4: Machine Config Format Issues
Problem:
machineconfigs/n1.yamlwas in resource dump format (withspec: |wrapper)- YAML indentation errors in various config files
- SOPS encryption complications
Solution:
- Use
.decrypted~files for direct manipulation - Careful YAML indentation (list items with inline keys)
- Apply configs in maintenance mode with
--insecureflag
Migration Plan: Phased VLAN Rollout
Prerequisites
- ✅ All nodes in stable, working state (DONE)
- ✅ Global Talos API access enabled (DONE)
- ✅ GitOps with Flux operational (DONE)
- ⏳ Verify Longhorn S3 backups are current
- ⏳ Document current pod placement and workload state
Phase 1: Prepare Configurations
1.1 Update Machine Configs for VLAN
For each node, update the machine config:
n1 (control plane):
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
n2 & n3 (workers):
cluster:
controlPlane:
endpoint: https://<NODE_1_IP>:6443 # Use n1's VLAN IP
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
1.2 Update Cilium Configuration
Verify Cilium is configured to use VLAN interface:
# manifests/infrastructure/cilium/release.yaml
values:
kubeProxyReplacement: strict
# Ensure Cilium detects and uses VLAN interface
Phase 2: Test with Worker Node First
2.1 Migrate n3 (Worker Node)
Test VLAN migration on a worker node first:
# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
--file machineconfigs/n3-vlan.yaml
# Wait for n3 to reboot
sleep 60
# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
2.2 Validate n3 Connectivity
# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status
# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>
# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
2.3 Decision Point
- ✅ If successful: Proceed to Phase 3
- ❌ If issues: Revert n3 to external IP (rollback plan)
Phase 3: Migrate Second Worker (n2)
Repeat Phase 2 steps for n2:
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
--file machineconfigs/n2-vlan.yaml
Validate connectivity and inter-node traffic on VLAN.
Phase 4: Migrate Control Plane (n1)
CRITICAL: This is the most sensitive step.
4.1 Prepare for Downtime
- ⚠️ Expected downtime: 2-5 minutes
- Inform users of maintenance window
- Ensure workers (n2, n3) are stable
4.2 Apply Config to n1
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
--file machineconfigs/n1-vlan.yaml
4.3 Monitor API Server Recovery
# Watch for API server to come back online
watch -n 2 "kubectl get nodes"
# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status
# Verify all nodes on VLAN
kubectl get nodes -o wide
Phase 5: Validation & Verification
5.1 Verify VLAN Traffic
# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
echo "=== $node ==="
talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done
5.2 Verify Pod Connectivity
# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'
# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>
5.3 Monitor for 24 Hours
- Watch for network issues
- Monitor Longhorn replication
- Check application logs
- Verify external services (Mastodon, Pixelfed, etc.)
Rollback Plan
If Issues Occur During Migration
Rollback Individual Node
# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
--file machineconfigs/<node>-external.yaml
Complete Cluster Rollback
If systemic issues occur:
- Revert n1 first (control plane is critical)
- Revert n2 and n3
- Verify all nodes back on external IPs
- Investigate root cause before retry
Emergency Recovery (If Locked Out)
If you lose access during migration:
-
Access via NetCup Console:
- Boot node into maintenance mode via NetCup dashboard
- Apply rollback config with
--insecureflag
-
Rescue Mode (Last Resort):
- Boot into NetCup rescue system
- Mount XFS partitions (need
xfsprogs) - Manually edit configs (complex, avoid if possible)
Key Talos Configuration References
Multihoming Configuration
According to Talos Multihoming Docs:
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Selects IP from VLAN subnet
Kubelet node-ip Setting
From Kubernetes Kubelet Docs:
--node-ip: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)- Controls which IP kubelet advertises to API server
- Determines routing for pod-to-pod traffic
Network Connectivity Requirements
Per Talos Network Connectivity Docs:
Control Plane Nodes:
- TCP 50000: apid (used by talosctl, control plane nodes)
- TCP 50001: trustd (used by worker nodes)
Worker Nodes:
- TCP 50000: apid (used by control plane nodes)
Lessons Learned
What Went Wrong
- Incremental migration without proper planning - Migrated n1 first without considering Tailscale dependencies
- Inadequate firewall policies - Talos API blocked externally, causing lockout
- API endpoint mismatch - DNS resolution didn't match node-ip configuration
- Config file format confusion - Multiple formats caused application errors
What Went Right
- ✅ Global Talos API access - Prevents future lockouts
- ✅ GitOps with Flux - Automatic workload recovery after etcd bootstrap
- ✅ Maintenance mode recovery - Reliable way to regain access
- ✅ External IP baseline - Stable configuration to fall back to
Best Practices Going Forward
- Test on workers first - Validate VLAN setup before touching control plane
- Document all configs - Keep clear record of working configurations
- Monitor traffic - Use
talosctl read /proc/net/devto verify VLAN usage - Backup etcd - Regular etcd backups to avoid data loss
- Plan for downtime - Maintenance windows for control plane changes
Success Criteria
Migration is successful when:
- ✅ All nodes showing VLAN IPs in
kubectl get nodes -o wide - ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
- ✅ All pods healthy and communicating
- ✅ Longhorn replication working
- ✅ External services (Mastodon, Pixelfed, etc.) operational
- ✅ No performance degradation
- ✅ 24-hour stability test passed
Additional Resources
- Talos Multihoming Documentation
- Talos Production Notes
- Kubernetes Kubelet Reference
- Cilium Documentation
Contact & Maintenance
Last Updated: 2025-11-20
Cluster: keyboardvagabond.com
Status: Nodes operational on external IPs, VLAN migration pending