docs/VLAN-NODE-IP-MIGRATION.md

# VLAN Node-IP Migration Plan

## Document Purpose
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.

## Current State (2025-11-20)

### Cluster Status
- **n1** (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
- **n2** (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
- **n3** (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅

### Current Configuration
All nodes are using **external IPs** for `node-ip`:
- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
- n3: `node-ip: <NODE_3_EXTERNAL_IP>`

### Issues with Current Setup
1. ❌ Inter-node pod traffic uses **public internet** (external IPs)
2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused**
3. ❌ Less secure (traffic exposed on public network)
4. ❌ Potentially slower for inter-pod communication

### What's Working
1. ✅ All nodes joined and operational
2. ✅ Cilium CNI deployed and functional
3. ✅ Global Talos API access enabled (ports 50000, 50001)
4. ✅ GitOps with Flux operational
5. ✅ Core infrastructure recovering

## Goal: VLAN Migration

### Target Configuration
All nodes using **VLAN IPs** for `node-ip`:
- n1: `<NODE_1_IP>` (control plane)
- n2: `<NODE_2_IP>` (worker)
- n3: `<NODE_3_IP>` (worker)

### Benefits
1. ✅ 100Mbps dedicated bandwidth for inter-node traffic
2. ✅ Private network (more secure)
3. ✅ Lower latency for pod-to-pod communication
4. ✅ Production-ready architecture

## Issues Encountered During Initial Attempt

### Issue 1: API Server Endpoint Mismatch
**Problem:**
- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
- Worker nodes with VLAN node-ip couldn't reach API server
- n3 failed to join cluster

**Solution:**
Must choose ONE of:
- **Option A:** Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
- **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)

**Recommended:** Option A (simpler, no DNS changes needed)

### Issue 2: Cluster Lockout After n1 Migration
**Problem:**
- When n1 was changed to VLAN node ip, all access was lost
- Tailscale pods couldn't start (needed API server access)
- Cilium policies blocked external Talos API access
- Complete lockout - no `kubectl` or `talosctl` access

**Root Cause:**
- Tailscale requires API server to be reachable from external network
- Once n1 switched to VLAN-only, Tailscale couldn't connect
- Without Tailscale, no VPN access to cluster

**Solution:**
- ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies
- This prevents future lockouts during network migrations

### Issue 3: etcd Data Loss After Bootstrap
**Problem:**
- After multiple reboots/config changes, etcd lost its data
- `/var/lib/etcd/member` directory was empty
- etcd stuck waiting to join cluster

**Solution:**
- Ran `talosctl bootstrap` to reinitialize etcd
- GitOps (Flux) automatically redeployed all workloads from Git
- Longhorn has S3 backups for persistent data recovery

### Issue 4: Machine Config Format Issues
**Problem:**
- `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper)
- YAML indentation errors in various config files
- SOPS encryption complications

**Solution:**
- Use `.decrypted~` files for direct manipulation
- Careful YAML indentation (list items with inline keys)
- Apply configs in maintenance mode with `--insecure` flag

## Migration Plan: Phased VLAN Rollout

### Prerequisites
1. ✅ All nodes in stable, working state (DONE)
2. ✅ Global Talos API access enabled (DONE)
3. ✅ GitOps with Flux operational (DONE)
4. ⏳ Verify Longhorn S3 backups are current
5. ⏳ Document current pod placement and workload state

### Phase 1: Prepare Configurations

#### 1.1 Update Machine Configs for VLAN
For each node, update the machine config:

**n1 (control plane):**
```yaml
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection
```

**n2 & n3 (workers):**
```yaml
cluster:
  controlPlane:
    endpoint: https://<NODE_1_IP>:6443  # Use n1's VLAN IP

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection
```

#### 1.2 Update Cilium Configuration
Verify Cilium is configured to use VLAN interface:

```yaml
# manifests/infrastructure/cilium/release.yaml
values:
  kubeProxyReplacement: strict
  # Ensure Cilium detects and uses VLAN interface
```

### Phase 2: Test with Worker Node First

#### 2.1 Migrate n3 (Worker Node)
Test VLAN migration on a worker node first:

```bash
# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
  --file machineconfigs/n3-vlan.yaml

# Wait for n3 to reboot
sleep 60

# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
```

#### 2.2 Validate n3 Connectivity
```bash
# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status

# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>

# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
```

#### 2.3 Decision Point
- ✅ If successful: Proceed to Phase 3
- ❌ If issues: Revert n3 to external IP (rollback plan)

### Phase 3: Migrate Second Worker (n2)

Repeat Phase 2 steps for n2:

```bash
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
  --file machineconfigs/n2-vlan.yaml
```

Validate connectivity and inter-node traffic on VLAN.

### Phase 4: Migrate Control Plane (n1)

**CRITICAL:** This is the most sensitive step.

#### 4.1 Prepare for Downtime
- ⚠️ **Expected downtime:** 2-5 minutes
- Inform users of maintenance window
- Ensure workers (n2, n3) are stable

#### 4.2 Apply Config to n1
```bash
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
  --file machineconfigs/n1-vlan.yaml
```

#### 4.3 Monitor API Server Recovery
```bash
# Watch for API server to come back online
watch -n 2 "kubectl get nodes"

# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status

# Verify all nodes on VLAN
kubectl get nodes -o wide
```

### Phase 5: Validation & Verification

#### 5.1 Verify VLAN Traffic
```bash
# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
  echo "=== $node ==="
  talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done
```

#### 5.2 Verify Pod Connectivity
```bash
# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'

# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>
```

#### 5.3 Monitor for 24 Hours
- Watch for network issues
- Monitor Longhorn replication
- Check application logs
- Verify external services (Mastodon, Pixelfed, etc.)

## Rollback Plan

### If Issues Occur During Migration

#### Rollback Individual Node
```bash
# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
  --file machineconfigs/<node>-external.yaml
```

#### Complete Cluster Rollback
If systemic issues occur:
1. Revert n1 first (control plane is critical)
2. Revert n2 and n3
3. Verify all nodes back on external IPs
4. Investigate root cause before retry

### Emergency Recovery (If Locked Out)

If you lose access during migration:

1. **Access via NetCup Console:**
   - Boot node into maintenance mode via NetCup dashboard
   - Apply rollback config with `--insecure` flag

2. **Rescue Mode (Last Resort):**
   - Boot into NetCup rescue system
   - Mount XFS partitions (need `xfsprogs`)
   - Manually edit configs (complex, avoid if possible)

## Key Talos Configuration References

### Multihoming Configuration
According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):

```yaml
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Selects IP from VLAN subnet
```

### Kubelet node-ip Setting
From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):
- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
- Controls which IP kubelet advertises to API server
- Determines routing for pod-to-pod traffic

### Network Connectivity Requirements
Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):

**Control Plane Nodes:**
- TCP 50000: apid (used by talosctl, control plane nodes)
- TCP 50001: trustd (used by worker nodes)

**Worker Nodes:**
- TCP 50000: apid (used by control plane nodes)

## Lessons Learned

### What Went Wrong
1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies
2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout
3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration
4. **Config file format confusion** - Multiple formats caused application errors

### What Went Right
1. ✅ **Global Talos API access** - Prevents future lockouts
2. ✅ **GitOps with Flux** - Automatic workload recovery after etcd bootstrap
3. ✅ **Maintenance mode recovery** - Reliable way to regain access
4. ✅ **External IP baseline** - Stable configuration to fall back to

### Best Practices Going Forward
1. **Test on workers first** - Validate VLAN setup before touching control plane
2. **Document all configs** - Keep clear record of working configurations
3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage
4. **Backup etcd** - Regular etcd backups to avoid data loss
5. **Plan for downtime** - Maintenance windows for control plane changes

## Success Criteria

Migration is successful when:
1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
3. ✅ All pods healthy and communicating
4. ✅ Longhorn replication working
5. ✅ External services (Mastodon, Pixelfed, etc.) operational
6. ✅ No performance degradation
7. ✅ 24-hour stability test passed

## Additional Resources

- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)
- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)
- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
- [Cilium Documentation](https://docs.cilium.io/)

## Contact & Maintenance

**Last Updated:** 2025-11-20  
**Cluster:** keyboardvagabond.com  
**Status:** Nodes operational on external IPs, VLAN migration pending
add source code and readme 2025-12-24 14:35:17 +01:00			`# VLAN Node-IP Migration Plan`

			`## Document Purpose`
			`This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.`

			`## Current State (2025-11-20)`

			`### Cluster Status`
			- n1 (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
			- n2 (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
			- n3 (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅

			`### Current Configuration`
			All nodes are using external IPs for `node-ip`:
			- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
			- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
			- n3: `node-ip: <NODE_3_EXTERNAL_IP>`

			`### Issues with Current Setup`
			`1. ❌ Inter-node pod traffic uses public internet (external IPs)`
			`2. ❌ VLAN bandwidth (100Mbps dedicated) is unused`
			`3. ❌ Less secure (traffic exposed on public network)`
			`4. ❌ Potentially slower for inter-pod communication`

			`### What's Working`
			`1. ✅ All nodes joined and operational`
			`2. ✅ Cilium CNI deployed and functional`
			`3. ✅ Global Talos API access enabled (ports 50000, 50001)`
			`4. ✅ GitOps with Flux operational`
			`5. ✅ Core infrastructure recovering`

			`## Goal: VLAN Migration`

			`### Target Configuration`
			All nodes using VLAN IPs for `node-ip`:
			- n1: `<NODE_1_IP>` (control plane)
			- n2: `<NODE_2_IP>` (worker)
			- n3: `<NODE_3_IP>` (worker)

			`### Benefits`
			`1. ✅ 100Mbps dedicated bandwidth for inter-node traffic`
			`2. ✅ Private network (more secure)`
			`3. ✅ Lower latency for pod-to-pod communication`
			`4. ✅ Production-ready architecture`

			`## Issues Encountered During Initial Attempt`

			`### Issue 1: API Server Endpoint Mismatch`
			`Problem:`
			- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
			`- Worker nodes with VLAN node-ip couldn't reach API server`
			`- n3 failed to join cluster`

			`Solution:`
			`Must choose ONE of:`
			- Option A: Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
			- Option B: Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)

			`Recommended: Option A (simpler, no DNS changes needed)`

			`### Issue 2: Cluster Lockout After n1 Migration`
			`Problem:`
			`- When n1 was changed to VLAN node ip, all access was lost`
			`- Tailscale pods couldn't start (needed API server access)`
			`- Cilium policies blocked external Talos API access`
			- Complete lockout - no `kubectl` or `talosctl` access

			`Root Cause:`
			`- Tailscale requires API server to be reachable from external network`
			`- Once n1 switched to VLAN-only, Tailscale couldn't connect`
			`- Without Tailscale, no VPN access to cluster`

			`Solution:`
			`- ✅ Enabled global Talos API access (ports 50000, 50001) in Cilium policies`
			`- This prevents future lockouts during network migrations`

			`### Issue 3: etcd Data Loss After Bootstrap`
			`Problem:`
			`- After multiple reboots/config changes, etcd lost its data`
			- `/var/lib/etcd/member` directory was empty
			`- etcd stuck waiting to join cluster`

			`Solution:`
			- Ran `talosctl bootstrap` to reinitialize etcd
			`- GitOps (Flux) automatically redeployed all workloads from Git`
			`- Longhorn has S3 backups for persistent data recovery`

			`### Issue 4: Machine Config Format Issues`
			`Problem:`
			- `machineconfigs/n1.yaml` was in resource dump format (with `spec: \|` wrapper)
			`- YAML indentation errors in various config files`
			`- SOPS encryption complications`

			`Solution:`
			- Use `.decrypted~` files for direct manipulation
			`- Careful YAML indentation (list items with inline keys)`
			- Apply configs in maintenance mode with `--insecure` flag

			`## Migration Plan: Phased VLAN Rollout`

			`### Prerequisites`
			`1. ✅ All nodes in stable, working state (DONE)`
			`2. ✅ Global Talos API access enabled (DONE)`
			`3. ✅ GitOps with Flux operational (DONE)`
			`4. ⏳ Verify Longhorn S3 backups are current`
			`5. ⏳ Document current pod placement and workload state`

			`### Phase 1: Prepare Configurations`

			`#### 1.1 Update Machine Configs for VLAN`
			`For each node, update the machine config:`

			`n1 (control plane):`
			```yaml
			`machine:`
			`kubelet:`
			`nodeIP:`
			`validSubnets:`
			`- 10.132.0.0/24 # Force VLAN IP selection`
			```

			`n2 & n3 (workers):`
			```yaml
			`cluster:`
			`controlPlane:`
			`endpoint: https://<NODE_1_IP>:6443 # Use n1's VLAN IP`

			`machine:`
			`kubelet:`
			`nodeIP:`
			`validSubnets:`
			`- 10.132.0.0/24 # Force VLAN IP selection`
			```

			`#### 1.2 Update Cilium Configuration`
			`Verify Cilium is configured to use VLAN interface:`

			```yaml
			`# manifests/infrastructure/cilium/release.yaml`
			`values:`
			`kubeProxyReplacement: strict`
			`# Ensure Cilium detects and uses VLAN interface`
			```

			`### Phase 2: Test with Worker Node First`

			`#### 2.1 Migrate n3 (Worker Node)`
			`Test VLAN migration on a worker node first:`

			```bash
			`# Apply updated config to n3`
more redaction 2025-12-24 14:39:47 +01:00			`cd /Users/<USERNAME>/src/keyboard-vagabond`
add source code and readme 2025-12-24 14:35:17 +01:00			`talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \`
			`--file machineconfigs/n3-vlan.yaml`

			`# Wait for n3 to reboot`
			`sleep 60`

			`# Verify n3 joined with VLAN IP`
			`kubectl get nodes -o wide`
			`# Should show: n3 INTERNAL-IP: <NODE_3_IP>`
			```

			`#### 2.2 Validate n3 Connectivity`
			```bash
			`# Check Cilium status on n3`
			`kubectl exec -n kube-system ds/cilium -- cilium status`

			`# Verify pod-to-pod communication`
			`kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>`

			`# Check inter-node traffic is using VLAN`
			`talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev \| grep enp9s0`
			```

			`#### 2.3 Decision Point`
			`- ✅ If successful: Proceed to Phase 3`
			`- ❌ If issues: Revert n3 to external IP (rollback plan)`

			`### Phase 3: Migrate Second Worker (n2)`

			`Repeat Phase 2 steps for n2:`

			```bash
			`talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \`
			`--file machineconfigs/n2-vlan.yaml`
			```

			`Validate connectivity and inter-node traffic on VLAN.`

			`### Phase 4: Migrate Control Plane (n1)`

			`CRITICAL: This is the most sensitive step.`

			`#### 4.1 Prepare for Downtime`
			`- ⚠️ Expected downtime: 2-5 minutes`
			`- Inform users of maintenance window`
			`- Ensure workers (n2, n3) are stable`

			`#### 4.2 Apply Config to n1`
			```bash
			`talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \`
			`--file machineconfigs/n1-vlan.yaml`
			```

			`#### 4.3 Monitor API Server Recovery`
			```bash
			`# Watch for API server to come back online`
			`watch -n 2 "kubectl get nodes"`

			`# Check etcd health`
			`talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status`

			`# Verify all nodes on VLAN`
			`kubectl get nodes -o wide`
			```

			`### Phase 5: Validation & Verification`

			`#### 5.1 Verify VLAN Traffic`
			```bash
			`# Check network traffic on VLAN interface (enp9s0)`
			`for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do`
			`echo "=== $node ==="`
			`talosctl -e $node -n $node read /proc/net/dev \| grep enp9s0`
			`done`
			```

			`#### 5.2 Verify Pod Connectivity`
			```bash
			`# Deploy test pods across nodes`
			`kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'`
			`kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'`
			`kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'`

			`# Test cross-node communication`
			`kubectl exec test-n1 -- curl <test-n2-pod-ip>`
			`kubectl exec test-n2 -- curl <test-n3-pod-ip>`
			```

			`#### 5.3 Monitor for 24 Hours`
			`- Watch for network issues`
			`- Monitor Longhorn replication`
			`- Check application logs`
			`- Verify external services (Mastodon, Pixelfed, etc.)`

			`## Rollback Plan`

			`### If Issues Occur During Migration`

			`#### Rollback Individual Node`
			```bash
			`# Create rollback config with external IP`
			`# Apply to affected node`
			`talosctl -e <node-external-ip> -n <node-external-ip> apply-config \`
			`--file machineconfigs/<node>-external.yaml`
			```

			`#### Complete Cluster Rollback`
			`If systemic issues occur:`
			`1. Revert n1 first (control plane is critical)`
			`2. Revert n2 and n3`
			`3. Verify all nodes back on external IPs`
			`4. Investigate root cause before retry`

			`### Emergency Recovery (If Locked Out)`

			`If you lose access during migration:`

			`1. Access via NetCup Console:`
			`- Boot node into maintenance mode via NetCup dashboard`
			- Apply rollback config with `--insecure` flag

			`2. Rescue Mode (Last Resort):`
			`- Boot into NetCup rescue system`
			- Mount XFS partitions (need `xfsprogs`)
			`- Manually edit configs (complex, avoid if possible)`

			`## Key Talos Configuration References`

			`### Multihoming Configuration`
			`According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):`

			```yaml
			`machine:`
			`kubelet:`
			`nodeIP:`
			`validSubnets:`
			`- 10.132.0.0/24 # Selects IP from VLAN subnet`
			```

			`### Kubelet node-ip Setting`
			`From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):`
			- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
			`- Controls which IP kubelet advertises to API server`
			`- Determines routing for pod-to-pod traffic`

			`### Network Connectivity Requirements`
			`Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):`

			`Control Plane Nodes:`
			`- TCP 50000: apid (used by talosctl, control plane nodes)`
			`- TCP 50001: trustd (used by worker nodes)`

			`Worker Nodes:`
			`- TCP 50000: apid (used by control plane nodes)`

			`## Lessons Learned`

			`### What Went Wrong`
			`1. Incremental migration without proper planning - Migrated n1 first without considering Tailscale dependencies`
			`2. Inadequate firewall policies - Talos API blocked externally, causing lockout`
			`3. API endpoint mismatch - DNS resolution didn't match node-ip configuration`
			`4. Config file format confusion - Multiple formats caused application errors`

			`### What Went Right`
			`1. ✅ Global Talos API access - Prevents future lockouts`
			`2. ✅ GitOps with Flux - Automatic workload recovery after etcd bootstrap`
			`3. ✅ Maintenance mode recovery - Reliable way to regain access`
			`4. ✅ External IP baseline - Stable configuration to fall back to`

			`### Best Practices Going Forward`
			`1. Test on workers first - Validate VLAN setup before touching control plane`
			`2. Document all configs - Keep clear record of working configurations`
			3. Monitor traffic - Use `talosctl read /proc/net/dev` to verify VLAN usage
			`4. Backup etcd - Regular etcd backups to avoid data loss`
			`5. Plan for downtime - Maintenance windows for control plane changes`

			`## Success Criteria`

			`Migration is successful when:`
			1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
			`2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)`
			`3. ✅ All pods healthy and communicating`
			`4. ✅ Longhorn replication working`
			`5. ✅ External services (Mastodon, Pixelfed, etc.) operational`
			`6. ✅ No performance degradation`
			`7. ✅ 24-hour stability test passed`

			`## Additional Resources`

			`- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)`
			`- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)`
			`- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)`
			`- [Cilium Documentation](https://docs.cilium.io/)`

			`## Contact & Maintenance`

			`Last Updated: 2025-11-20`
			`Cluster: keyboardvagabond.com`
			`Status: Nodes operational on external IPs, VLAN migration pending`