# VLAN Node-IP Migration Plan

## Document Purpose
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.

## Current State (2025-11-20)

### Cluster Status
- **n1** (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
- **n2** (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
- **n3** (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅

### Current Configuration
All nodes are using **external IPs** for `node-ip`:
- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
- n3: `node-ip: <NODE_3_EXTERNAL_IP>`

### Issues with Current Setup
1. ❌ Inter-node pod traffic uses **public internet** (external IPs)
2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused**
3. ❌ Less secure (traffic exposed on public network)
4. ❌ Potentially slower for inter-pod communication

### What's Working
1. ✅ All nodes joined and operational
2. ✅ Cilium CNI deployed and functional
3. ✅ Global Talos API access enabled (ports 50000, 50001)
4. ✅ GitOps with Flux operational
5. ✅ Core infrastructure recovering

## Goal: VLAN Migration

### Target Configuration
All nodes using **VLAN IPs** for `node-ip`:
- n1: `<NODE_1_IP>` (control plane)
- n2: `<NODE_2_IP>` (worker)
- n3: `<NODE_3_IP>` (worker)

### Benefits
1. ✅ 100Mbps dedicated bandwidth for inter-node traffic
2. ✅ Private network (more secure)
3. ✅ Lower latency for pod-to-pod communication
4. ✅ Production-ready architecture

## Issues Encountered During Initial Attempt

### Issue 1: API Server Endpoint Mismatch
**Problem:**
- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
- Worker nodes with VLAN node-ip couldn't reach API server
- n3 failed to join cluster

**Solution:**
Must choose ONE of:
- **Option A:** Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
- **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)

**Recommended:** Option A (simpler, no DNS changes needed)

### Issue 2: Cluster Lockout After n1 Migration
**Problem:**
- When n1 was changed to VLAN node ip, all access was lost
- Tailscale pods couldn't start (needed API server access)
- Cilium policies blocked external Talos API access
- Complete lockout - no `kubectl` or `talosctl` access

**Root Cause:**
- Tailscale requires API server to be reachable from external network
- Once n1 switched to VLAN-only, Tailscale couldn't connect
- Without Tailscale, no VPN access to cluster

**Solution:**
- ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies
- This prevents future lockouts during network migrations

### Issue 3: etcd Data Loss After Bootstrap
**Problem:**
- After multiple reboots/config changes, etcd lost its data
- `/var/lib/etcd/member` directory was empty
- etcd stuck waiting to join cluster

**Solution:**
- Ran `talosctl bootstrap` to reinitialize etcd
- GitOps (Flux) automatically redeployed all workloads from Git
- Longhorn has S3 backups for persistent data recovery

### Issue 4: Machine Config Format Issues
**Problem:**
- `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper)
- YAML indentation errors in various config files
- SOPS encryption complications

**Solution:**
- Use `.decrypted~` files for direct manipulation
- Careful YAML indentation (list items with inline keys)
- Apply configs in maintenance mode with `--insecure` flag

## Migration Plan: Phased VLAN Rollout

### Prerequisites
1. ✅ All nodes in stable, working state (DONE)
2. ✅ Global Talos API access enabled (DONE)
3. ✅ GitOps with Flux operational (DONE)
4. ⏳ Verify Longhorn S3 backups are current
5. ⏳ Document current pod placement and workload state

### Phase 1: Prepare Configurations

#### 1.1 Update Machine Configs for VLAN
For each node, update the machine config:

**n1 (control plane):**
```yaml
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection
```

**n2 & n3 (workers):**
```yaml
cluster:
  controlPlane:
    endpoint: https://<NODE_1_IP>:6443  # Use n1's VLAN IP

machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Force VLAN IP selection
```

#### 1.2 Update Cilium Configuration
Verify Cilium is configured to use VLAN interface:

```yaml
# manifests/infrastructure/cilium/release.yaml
values:
  kubeProxyReplacement: strict
  # Ensure Cilium detects and uses VLAN interface
```

### Phase 2: Test with Worker Node First

#### 2.1 Migrate n3 (Worker Node)
Test VLAN migration on a worker node first:

```bash
# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
  --file machineconfigs/n3-vlan.yaml

# Wait for n3 to reboot
sleep 60

# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
```

#### 2.2 Validate n3 Connectivity
```bash
# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status

# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>

# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
```

#### 2.3 Decision Point
- ✅ If successful: Proceed to Phase 3
- ❌ If issues: Revert n3 to external IP (rollback plan)

### Phase 3: Migrate Second Worker (n2)

Repeat Phase 2 steps for n2:

```bash
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
  --file machineconfigs/n2-vlan.yaml
```

Validate connectivity and inter-node traffic on VLAN.

### Phase 4: Migrate Control Plane (n1)

**CRITICAL:** This is the most sensitive step.

#### 4.1 Prepare for Downtime
- ⚠️ **Expected downtime:** 2-5 minutes
- Inform users of maintenance window
- Ensure workers (n2, n3) are stable

#### 4.2 Apply Config to n1
```bash
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
  --file machineconfigs/n1-vlan.yaml
```

#### 4.3 Monitor API Server Recovery
```bash
# Watch for API server to come back online
watch -n 2 "kubectl get nodes"

# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status

# Verify all nodes on VLAN
kubectl get nodes -o wide
```

### Phase 5: Validation & Verification

#### 5.1 Verify VLAN Traffic
```bash
# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
  echo "=== $node ==="
  talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done
```

#### 5.2 Verify Pod Connectivity
```bash
# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'

# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>
```

#### 5.3 Monitor for 24 Hours
- Watch for network issues
- Monitor Longhorn replication
- Check application logs
- Verify external services (Mastodon, Pixelfed, etc.)

## Rollback Plan

### If Issues Occur During Migration

#### Rollback Individual Node
```bash
# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
  --file machineconfigs/<node>-external.yaml
```

#### Complete Cluster Rollback
If systemic issues occur:
1. Revert n1 first (control plane is critical)
2. Revert n2 and n3
3. Verify all nodes back on external IPs
4. Investigate root cause before retry

### Emergency Recovery (If Locked Out)

If you lose access during migration:

1. **Access via NetCup Console:**
   - Boot node into maintenance mode via NetCup dashboard
   - Apply rollback config with `--insecure` flag

2. **Rescue Mode (Last Resort):**
   - Boot into NetCup rescue system
   - Mount XFS partitions (need `xfsprogs`)
   - Manually edit configs (complex, avoid if possible)

## Key Talos Configuration References

### Multihoming Configuration
According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):

```yaml
machine:
  kubelet:
    nodeIP:
      validSubnets:
        - 10.132.0.0/24  # Selects IP from VLAN subnet
```

### Kubelet node-ip Setting
From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):
- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
- Controls which IP kubelet advertises to API server
- Determines routing for pod-to-pod traffic

### Network Connectivity Requirements
Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):

**Control Plane Nodes:**
- TCP 50000: apid (used by talosctl, control plane nodes)
- TCP 50001: trustd (used by worker nodes)

**Worker Nodes:**
- TCP 50000: apid (used by control plane nodes)

## Lessons Learned

### What Went Wrong
1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies
2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout
3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration
4. **Config file format confusion** - Multiple formats caused application errors

### What Went Right
1. ✅ **Global Talos API access** - Prevents future lockouts
2. ✅ **GitOps with Flux** - Automatic workload recovery after etcd bootstrap
3. ✅ **Maintenance mode recovery** - Reliable way to regain access
4. ✅ **External IP baseline** - Stable configuration to fall back to

### Best Practices Going Forward
1. **Test on workers first** - Validate VLAN setup before touching control plane
2. **Document all configs** - Keep clear record of working configurations
3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage
4. **Backup etcd** - Regular etcd backups to avoid data loss
5. **Plan for downtime** - Maintenance windows for control plane changes

## Success Criteria

Migration is successful when:
1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
3. ✅ All pods healthy and communicating
4. ✅ Longhorn replication working
5. ✅ External services (Mastodon, Pixelfed, etc.) operational
6. ✅ No performance degradation
7. ✅ 24-hour stability test passed

## Additional Resources

- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)
- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)
- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
- [Cilium Documentation](https://docs.cilium.io/)

## Contact & Maintenance

**Last Updated:** 2025-11-20  
**Cluster:** keyboardvagabond.com  
**Status:** Nodes operational on external IPs, VLAN migration pending