Files

Michael DiLeo 91e6e2e502 more redaction

2025-12-24 14:39:47 +01:00

11 KiB

Raw Blame History

VLAN Node-IP Migration Plan

Document Purpose

This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.

Current State (2025-11-20)

Cluster Status

n1 (control plane): <NODE_1_EXTERNAL_IP> - Ready ✅
n2 (worker): <NODE_2_EXTERNAL_IP> - Ready ✅
n3 (worker): <NODE_3_EXTERNAL_IP> - Ready ✅

Current Configuration

All nodes are using external IPs for node-ip:

n1: node-ip: <NODE_1_EXTERNAL_IP>
n2: node-ip: <NODE_2_EXTERNAL_IP>
n3: node-ip: <NODE_3_EXTERNAL_IP>

Issues with Current Setup

❌ Inter-node pod traffic uses public internet (external IPs)
❌ VLAN bandwidth (100Mbps dedicated) is unused
❌ Less secure (traffic exposed on public network)
❌ Potentially slower for inter-pod communication

What's Working

✅ All nodes joined and operational
✅ Cilium CNI deployed and functional
✅ Global Talos API access enabled (ports 50000, 50001)
✅ GitOps with Flux operational
✅ Core infrastructure recovering

Goal: VLAN Migration

Target Configuration

All nodes using VLAN IPs for node-ip:

n1: <NODE_1_IP> (control plane)
n2: <NODE_2_IP> (worker)
n3: <NODE_3_IP> (worker)

Benefits

✅ 100Mbps dedicated bandwidth for inter-node traffic
✅ Private network (more secure)
✅ Lower latency for pod-to-pod communication
✅ Production-ready architecture

Issues Encountered During Initial Attempt

Issue 1: API Server Endpoint Mismatch

Problem:

api.keyboardvagabond.com resolves to n1's external IP (<NODE_1_EXTERNAL_IP>)
Worker nodes with VLAN node-ip couldn't reach API server
n3 failed to join cluster

Solution: Must choose ONE of:

Option A: Set cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443 in ALL machine configs
Option B: Update DNS so api.keyboardvagabond.com resolves to <NODE_1_IP> (VLAN IP)

Recommended: Option A (simpler, no DNS changes needed)

Issue 2: Cluster Lockout After n1 Migration

Problem:

When n1 was changed to VLAN node ip, all access was lost
Tailscale pods couldn't start (needed API server access)
Cilium policies blocked external Talos API access
Complete lockout - no kubectl or talosctl access

Root Cause:

Tailscale requires API server to be reachable from external network
Once n1 switched to VLAN-only, Tailscale couldn't connect
Without Tailscale, no VPN access to cluster

Solution:

✅ Enabled global Talos API access (ports 50000, 50001) in Cilium policies
This prevents future lockouts during network migrations

Issue 3: etcd Data Loss After Bootstrap

Problem:

After multiple reboots/config changes, etcd lost its data
/var/lib/etcd/member directory was empty
etcd stuck waiting to join cluster

Solution:

Ran talosctl bootstrap to reinitialize etcd
GitOps (Flux) automatically redeployed all workloads from Git
Longhorn has S3 backups for persistent data recovery

Issue 4: Machine Config Format Issues

Problem:

machineconfigs/n1.yaml was in resource dump format (with spec: | wrapper)
YAML indentation errors in various config files
SOPS encryption complications

Solution:

Use .decrypted~ files for direct manipulation
Careful YAML indentation (list items with inline keys)
Apply configs in maintenance mode with --insecure flag

Migration Plan: Phased VLAN Rollout

Prerequisites

✅ All nodes in stable, working state (DONE)
✅ Global Talos API access enabled (DONE)
✅ GitOps with Flux operational (DONE)
⏳ Verify Longhorn S3 backups are current
⏳ Document current pod placement and workload state

Phase 1: Prepare Configurations

1.1 Update Machine Configs for VLAN

For each node, update the machine config:

n1 (control plane):

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection

n2 & n3 (workers):

cluster:
  controlPlane:
    endpoint: https://<NODE_1_IP>:6443  # Use n1's VLAN IP

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection

1.2 Update Cilium Configuration

Verify Cilium is configured to use VLAN interface:

# manifests/infrastructure/cilium/release.yaml
values:
  kubeProxyReplacement: strict
  # Ensure Cilium detects and uses VLAN interface

Phase 2: Test with Worker Node First

2.1 Migrate n3 (Worker Node)

Test VLAN migration on a worker node first:

# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
  --file machineconfigs/n3-vlan.yaml

# Wait for n3 to reboot
sleep 60

# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>

2.2 Validate n3 Connectivity

# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status

# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>

# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0

2.3 Decision Point

✅ If successful: Proceed to Phase 3
❌ If issues: Revert n3 to external IP (rollback plan)

Phase 3: Migrate Second Worker (n2)

Repeat Phase 2 steps for n2:

talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
  --file machineconfigs/n2-vlan.yaml

Validate connectivity and inter-node traffic on VLAN.

Phase 4: Migrate Control Plane (n1)

CRITICAL: This is the most sensitive step.

4.1 Prepare for Downtime

⚠️ Expected downtime: 2-5 minutes
Inform users of maintenance window
Ensure workers (n2, n3) are stable

4.2 Apply Config to n1

talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
  --file machineconfigs/n1-vlan.yaml

4.3 Monitor API Server Recovery

# Watch for API server to come back online
watch -n 2 "kubectl get nodes"

# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status

# Verify all nodes on VLAN
kubectl get nodes -o wide

Phase 5: Validation & Verification

5.1 Verify VLAN Traffic

# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
  echo "=== $node ==="
  talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done

5.2 Verify Pod Connectivity

# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'

# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>

5.3 Monitor for 24 Hours

Watch for network issues
Monitor Longhorn replication
Check application logs
Verify external services (Mastodon, Pixelfed, etc.)

Rollback Plan

If Issues Occur During Migration

Rollback Individual Node

# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
  --file machineconfigs/<node>-external.yaml

Complete Cluster Rollback

If systemic issues occur:

Revert n1 first (control plane is critical)
Revert n2 and n3
Verify all nodes back on external IPs
Investigate root cause before retry

Emergency Recovery (If Locked Out)

If you lose access during migration:

Access via NetCup Console:
- Boot node into maintenance mode via NetCup dashboard
- Apply rollback config with --insecure flag
Rescue Mode (Last Resort):
- Boot into NetCup rescue system
- Mount XFS partitions (need xfsprogs)
- Manually edit configs (complex, avoid if possible)

Key Talos Configuration References

Multihoming Configuration

According to Talos Multihoming Docs:

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Selects IP from VLAN subnet

Kubelet node-ip Setting

From Kubernetes Kubelet Docs:

--node-ip: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
Controls which IP kubelet advertises to API server
Determines routing for pod-to-pod traffic

Network Connectivity Requirements

Per Talos Network Connectivity Docs:

Control Plane Nodes:

TCP 50000: apid (used by talosctl, control plane nodes)
TCP 50001: trustd (used by worker nodes)

Worker Nodes:

TCP 50000: apid (used by control plane nodes)

Lessons Learned

What Went Wrong

Incremental migration without proper planning - Migrated n1 first without considering Tailscale dependencies
Inadequate firewall policies - Talos API blocked externally, causing lockout
API endpoint mismatch - DNS resolution didn't match node-ip configuration
Config file format confusion - Multiple formats caused application errors

What Went Right

✅ Global Talos API access - Prevents future lockouts
✅ GitOps with Flux - Automatic workload recovery after etcd bootstrap
✅ Maintenance mode recovery - Reliable way to regain access
✅ External IP baseline - Stable configuration to fall back to

Best Practices Going Forward

Test on workers first - Validate VLAN setup before touching control plane
Document all configs - Keep clear record of working configurations
Monitor traffic - Use talosctl read /proc/net/dev to verify VLAN usage
Backup etcd - Regular etcd backups to avoid data loss
Plan for downtime - Maintenance windows for control plane changes

Success Criteria

Migration is successful when:

✅ All nodes showing VLAN IPs in kubectl get nodes -o wide
✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
✅ All pods healthy and communicating
✅ Longhorn replication working
✅ External services (Mastodon, Pixelfed, etc.) operational
✅ No performance degradation
✅ 24-hour stability test passed

Additional Resources

Contact & Maintenance

Last Updated: 2025-11-20
Cluster: keyboardvagabond.com
Status: Nodes operational on external IPs, VLAN migration pending

11 KiB Raw Blame History

VLAN Node-IP Migration Plan

Document Purpose

Current State (2025-11-20)

Cluster Status

Current Configuration

Issues with Current Setup

What's Working

Goal: VLAN Migration

Target Configuration

Benefits

Issues Encountered During Initial Attempt

Issue 1: API Server Endpoint Mismatch

Issue 2: Cluster Lockout After n1 Migration

Issue 3: etcd Data Loss After Bootstrap

Issue 4: Machine Config Format Issues

Migration Plan: Phased VLAN Rollout

Prerequisites

Phase 1: Prepare Configurations

1.1 Update Machine Configs for VLAN

1.2 Update Cilium Configuration

Phase 2: Test with Worker Node First

2.1 Migrate n3 (Worker Node)

2.2 Validate n3 Connectivity

2.3 Decision Point

Phase 3: Migrate Second Worker (n2)

Phase 4: Migrate Control Plane (n1)

4.1 Prepare for Downtime

4.2 Apply Config to n1

4.3 Monitor API Server Recovery

Phase 5: Validation & Verification

5.1 Verify VLAN Traffic

5.2 Verify Pod Connectivity

5.3 Monitor for 24 Hours

Rollback Plan

If Issues Occur During Migration

Rollback Individual Node

Complete Cluster Rollback

Emergency Recovery (If Locked Out)

Key Talos Configuration References

Multihoming Configuration

Kubelet node-ip Setting

Network Connectivity Requirements

Lessons Learned

What Went Wrong

What Went Right

Best Practices Going Forward

Success Criteria

Additional Resources

Contact & Maintenance

11 KiB

Raw Blame History