Files
Keybard-Vagabond-Demo/docs/VLAN-NODE-IP-MIGRATION.md
2025-12-24 14:39:47 +01:00

11 KiB

VLAN Node-IP Migration Plan

Document Purpose

This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.

Current State (2025-11-20)

Cluster Status

  • n1 (control plane): <NODE_1_EXTERNAL_IP> - Ready
  • n2 (worker): <NODE_2_EXTERNAL_IP> - Ready
  • n3 (worker): <NODE_3_EXTERNAL_IP> - Ready

Current Configuration

All nodes are using external IPs for node-ip:

  • n1: node-ip: <NODE_1_EXTERNAL_IP>
  • n2: node-ip: <NODE_2_EXTERNAL_IP>
  • n3: node-ip: <NODE_3_EXTERNAL_IP>

Issues with Current Setup

  1. Inter-node pod traffic uses public internet (external IPs)
  2. VLAN bandwidth (100Mbps dedicated) is unused
  3. Less secure (traffic exposed on public network)
  4. Potentially slower for inter-pod communication

What's Working

  1. All nodes joined and operational
  2. Cilium CNI deployed and functional
  3. Global Talos API access enabled (ports 50000, 50001)
  4. GitOps with Flux operational
  5. Core infrastructure recovering

Goal: VLAN Migration

Target Configuration

All nodes using VLAN IPs for node-ip:

  • n1: <NODE_1_IP> (control plane)
  • n2: <NODE_2_IP> (worker)
  • n3: <NODE_3_IP> (worker)

Benefits

  1. 100Mbps dedicated bandwidth for inter-node traffic
  2. Private network (more secure)
  3. Lower latency for pod-to-pod communication
  4. Production-ready architecture

Issues Encountered During Initial Attempt

Issue 1: API Server Endpoint Mismatch

Problem:

  • api.keyboardvagabond.com resolves to n1's external IP (<NODE_1_EXTERNAL_IP>)
  • Worker nodes with VLAN node-ip couldn't reach API server
  • n3 failed to join cluster

Solution: Must choose ONE of:

  • Option A: Set cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443 in ALL machine configs
  • Option B: Update DNS so api.keyboardvagabond.com resolves to <NODE_1_IP> (VLAN IP)

Recommended: Option A (simpler, no DNS changes needed)

Issue 2: Cluster Lockout After n1 Migration

Problem:

  • When n1 was changed to VLAN node ip, all access was lost
  • Tailscale pods couldn't start (needed API server access)
  • Cilium policies blocked external Talos API access
  • Complete lockout - no kubectl or talosctl access

Root Cause:

  • Tailscale requires API server to be reachable from external network
  • Once n1 switched to VLAN-only, Tailscale couldn't connect
  • Without Tailscale, no VPN access to cluster

Solution:

  • Enabled global Talos API access (ports 50000, 50001) in Cilium policies
  • This prevents future lockouts during network migrations

Issue 3: etcd Data Loss After Bootstrap

Problem:

  • After multiple reboots/config changes, etcd lost its data
  • /var/lib/etcd/member directory was empty
  • etcd stuck waiting to join cluster

Solution:

  • Ran talosctl bootstrap to reinitialize etcd
  • GitOps (Flux) automatically redeployed all workloads from Git
  • Longhorn has S3 backups for persistent data recovery

Issue 4: Machine Config Format Issues

Problem:

  • machineconfigs/n1.yaml was in resource dump format (with spec: | wrapper)
  • YAML indentation errors in various config files
  • SOPS encryption complications

Solution:

  • Use .decrypted~ files for direct manipulation
  • Careful YAML indentation (list items with inline keys)
  • Apply configs in maintenance mode with --insecure flag

Migration Plan: Phased VLAN Rollout

Prerequisites

  1. All nodes in stable, working state (DONE)
  2. Global Talos API access enabled (DONE)
  3. GitOps with Flux operational (DONE)
  4. Verify Longhorn S3 backups are current
  5. Document current pod placement and workload state

Phase 1: Prepare Configurations

1.1 Update Machine Configs for VLAN

For each node, update the machine config:

n1 (control plane):

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection

n2 & n3 (workers):

cluster:
  controlPlane:
    endpoint: https://<NODE_1_IP>:6443  # Use n1's VLAN IP

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection

1.2 Update Cilium Configuration

Verify Cilium is configured to use VLAN interface:

# manifests/infrastructure/cilium/release.yaml
values:
  kubeProxyReplacement: strict
  # Ensure Cilium detects and uses VLAN interface

Phase 2: Test with Worker Node First

2.1 Migrate n3 (Worker Node)

Test VLAN migration on a worker node first:

# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
  --file machineconfigs/n3-vlan.yaml

# Wait for n3 to reboot
sleep 60

# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>

2.2 Validate n3 Connectivity

# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status

# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>

# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0

2.3 Decision Point

  • If successful: Proceed to Phase 3
  • If issues: Revert n3 to external IP (rollback plan)

Phase 3: Migrate Second Worker (n2)

Repeat Phase 2 steps for n2:

talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
  --file machineconfigs/n2-vlan.yaml

Validate connectivity and inter-node traffic on VLAN.

Phase 4: Migrate Control Plane (n1)

CRITICAL: This is the most sensitive step.

4.1 Prepare for Downtime

  • ⚠️ Expected downtime: 2-5 minutes
  • Inform users of maintenance window
  • Ensure workers (n2, n3) are stable

4.2 Apply Config to n1

talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
  --file machineconfigs/n1-vlan.yaml

4.3 Monitor API Server Recovery

# Watch for API server to come back online
watch -n 2 "kubectl get nodes"

# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status

# Verify all nodes on VLAN
kubectl get nodes -o wide

Phase 5: Validation & Verification

5.1 Verify VLAN Traffic

# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
  echo "=== $node ==="
  talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done

5.2 Verify Pod Connectivity

# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'

# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>

5.3 Monitor for 24 Hours

  • Watch for network issues
  • Monitor Longhorn replication
  • Check application logs
  • Verify external services (Mastodon, Pixelfed, etc.)

Rollback Plan

If Issues Occur During Migration

Rollback Individual Node

# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
  --file machineconfigs/<node>-external.yaml

Complete Cluster Rollback

If systemic issues occur:

  1. Revert n1 first (control plane is critical)
  2. Revert n2 and n3
  3. Verify all nodes back on external IPs
  4. Investigate root cause before retry

Emergency Recovery (If Locked Out)

If you lose access during migration:

  1. Access via NetCup Console:

    • Boot node into maintenance mode via NetCup dashboard
    • Apply rollback config with --insecure flag
  2. Rescue Mode (Last Resort):

    • Boot into NetCup rescue system
    • Mount XFS partitions (need xfsprogs)
    • Manually edit configs (complex, avoid if possible)

Key Talos Configuration References

Multihoming Configuration

According to Talos Multihoming Docs:

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Selects IP from VLAN subnet

Kubelet node-ip Setting

From Kubernetes Kubelet Docs:

  • --node-ip: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
  • Controls which IP kubelet advertises to API server
  • Determines routing for pod-to-pod traffic

Network Connectivity Requirements

Per Talos Network Connectivity Docs:

Control Plane Nodes:

  • TCP 50000: apid (used by talosctl, control plane nodes)
  • TCP 50001: trustd (used by worker nodes)

Worker Nodes:

  • TCP 50000: apid (used by control plane nodes)

Lessons Learned

What Went Wrong

  1. Incremental migration without proper planning - Migrated n1 first without considering Tailscale dependencies
  2. Inadequate firewall policies - Talos API blocked externally, causing lockout
  3. API endpoint mismatch - DNS resolution didn't match node-ip configuration
  4. Config file format confusion - Multiple formats caused application errors

What Went Right

  1. Global Talos API access - Prevents future lockouts
  2. GitOps with Flux - Automatic workload recovery after etcd bootstrap
  3. Maintenance mode recovery - Reliable way to regain access
  4. External IP baseline - Stable configuration to fall back to

Best Practices Going Forward

  1. Test on workers first - Validate VLAN setup before touching control plane
  2. Document all configs - Keep clear record of working configurations
  3. Monitor traffic - Use talosctl read /proc/net/dev to verify VLAN usage
  4. Backup etcd - Regular etcd backups to avoid data loss
  5. Plan for downtime - Maintenance windows for control plane changes

Success Criteria

Migration is successful when:

  1. All nodes showing VLAN IPs in kubectl get nodes -o wide
  2. Inter-node traffic flowing over enp9s0 (VLAN interface)
  3. All pods healthy and communicating
  4. Longhorn replication working
  5. External services (Mastodon, Pixelfed, etc.) operational
  6. No performance degradation
  7. 24-hour stability test passed

Additional Resources

Contact & Maintenance

Last Updated: 2025-11-20
Cluster: keyboardvagabond.com
Status: Nodes operational on external IPs, VLAN migration pending