Files
Keybard-Vagabond-Demo/docs/VLAN-NODE-IP-MIGRATION.md

353 lines
11 KiB
Markdown
Raw Permalink Normal View History

2025-12-24 14:35:17 +01:00
# VLAN Node-IP Migration Plan
## Document Purpose
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.
## Current State (2025-11-20)
### Cluster Status
- **n1** (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
- **n2** (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
- **n3** (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅
### Current Configuration
All nodes are using **external IPs** for `node-ip`:
- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
- n3: `node-ip: <NODE_3_EXTERNAL_IP>`
### Issues with Current Setup
1. ❌ Inter-node pod traffic uses **public internet** (external IPs)
2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused**
3. ❌ Less secure (traffic exposed on public network)
4. ❌ Potentially slower for inter-pod communication
### What's Working
1. ✅ All nodes joined and operational
2. ✅ Cilium CNI deployed and functional
3. ✅ Global Talos API access enabled (ports 50000, 50001)
4. ✅ GitOps with Flux operational
5. ✅ Core infrastructure recovering
## Goal: VLAN Migration
### Target Configuration
All nodes using **VLAN IPs** for `node-ip`:
- n1: `<NODE_1_IP>` (control plane)
- n2: `<NODE_2_IP>` (worker)
- n3: `<NODE_3_IP>` (worker)
### Benefits
1. ✅ 100Mbps dedicated bandwidth for inter-node traffic
2. ✅ Private network (more secure)
3. ✅ Lower latency for pod-to-pod communication
4. ✅ Production-ready architecture
## Issues Encountered During Initial Attempt
### Issue 1: API Server Endpoint Mismatch
**Problem:**
- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
- Worker nodes with VLAN node-ip couldn't reach API server
- n3 failed to join cluster
**Solution:**
Must choose ONE of:
- **Option A:** Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
- **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)
**Recommended:** Option A (simpler, no DNS changes needed)
### Issue 2: Cluster Lockout After n1 Migration
**Problem:**
- When n1 was changed to VLAN node ip, all access was lost
- Tailscale pods couldn't start (needed API server access)
- Cilium policies blocked external Talos API access
- Complete lockout - no `kubectl` or `talosctl` access
**Root Cause:**
- Tailscale requires API server to be reachable from external network
- Once n1 switched to VLAN-only, Tailscale couldn't connect
- Without Tailscale, no VPN access to cluster
**Solution:**
- ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies
- This prevents future lockouts during network migrations
### Issue 3: etcd Data Loss After Bootstrap
**Problem:**
- After multiple reboots/config changes, etcd lost its data
- `/var/lib/etcd/member` directory was empty
- etcd stuck waiting to join cluster
**Solution:**
- Ran `talosctl bootstrap` to reinitialize etcd
- GitOps (Flux) automatically redeployed all workloads from Git
- Longhorn has S3 backups for persistent data recovery
### Issue 4: Machine Config Format Issues
**Problem:**
- `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper)
- YAML indentation errors in various config files
- SOPS encryption complications
**Solution:**
- Use `.decrypted~` files for direct manipulation
- Careful YAML indentation (list items with inline keys)
- Apply configs in maintenance mode with `--insecure` flag
## Migration Plan: Phased VLAN Rollout
### Prerequisites
1. ✅ All nodes in stable, working state (DONE)
2. ✅ Global Talos API access enabled (DONE)
3. ✅ GitOps with Flux operational (DONE)
4. ⏳ Verify Longhorn S3 backups are current
5. ⏳ Document current pod placement and workload state
### Phase 1: Prepare Configurations
#### 1.1 Update Machine Configs for VLAN
For each node, update the machine config:
**n1 (control plane):**
```yaml
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
```
**n2 & n3 (workers):**
```yaml
cluster:
controlPlane:
endpoint: https://<NODE_1_IP>:6443 # Use n1's VLAN IP
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
```
#### 1.2 Update Cilium Configuration
Verify Cilium is configured to use VLAN interface:
```yaml
# manifests/infrastructure/cilium/release.yaml
values:
kubeProxyReplacement: strict
# Ensure Cilium detects and uses VLAN interface
```
### Phase 2: Test with Worker Node First
#### 2.1 Migrate n3 (Worker Node)
Test VLAN migration on a worker node first:
```bash
# Apply updated config to n3
2025-12-24 14:39:47 +01:00
cd /Users/<USERNAME>/src/keyboard-vagabond
2025-12-24 14:35:17 +01:00
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
--file machineconfigs/n3-vlan.yaml
# Wait for n3 to reboot
sleep 60
# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
```
#### 2.2 Validate n3 Connectivity
```bash
# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status
# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>
# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
```
#### 2.3 Decision Point
- ✅ If successful: Proceed to Phase 3
- ❌ If issues: Revert n3 to external IP (rollback plan)
### Phase 3: Migrate Second Worker (n2)
Repeat Phase 2 steps for n2:
```bash
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
--file machineconfigs/n2-vlan.yaml
```
Validate connectivity and inter-node traffic on VLAN.
### Phase 4: Migrate Control Plane (n1)
**CRITICAL:** This is the most sensitive step.
#### 4.1 Prepare for Downtime
- ⚠️ **Expected downtime:** 2-5 minutes
- Inform users of maintenance window
- Ensure workers (n2, n3) are stable
#### 4.2 Apply Config to n1
```bash
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
--file machineconfigs/n1-vlan.yaml
```
#### 4.3 Monitor API Server Recovery
```bash
# Watch for API server to come back online
watch -n 2 "kubectl get nodes"
# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status
# Verify all nodes on VLAN
kubectl get nodes -o wide
```
### Phase 5: Validation & Verification
#### 5.1 Verify VLAN Traffic
```bash
# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
echo "=== $node ==="
talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done
```
#### 5.2 Verify Pod Connectivity
```bash
# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'
# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>
```
#### 5.3 Monitor for 24 Hours
- Watch for network issues
- Monitor Longhorn replication
- Check application logs
- Verify external services (Mastodon, Pixelfed, etc.)
## Rollback Plan
### If Issues Occur During Migration
#### Rollback Individual Node
```bash
# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
--file machineconfigs/<node>-external.yaml
```
#### Complete Cluster Rollback
If systemic issues occur:
1. Revert n1 first (control plane is critical)
2. Revert n2 and n3
3. Verify all nodes back on external IPs
4. Investigate root cause before retry
### Emergency Recovery (If Locked Out)
If you lose access during migration:
1. **Access via NetCup Console:**
- Boot node into maintenance mode via NetCup dashboard
- Apply rollback config with `--insecure` flag
2. **Rescue Mode (Last Resort):**
- Boot into NetCup rescue system
- Mount XFS partitions (need `xfsprogs`)
- Manually edit configs (complex, avoid if possible)
## Key Talos Configuration References
### Multihoming Configuration
According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):
```yaml
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Selects IP from VLAN subnet
```
### Kubelet node-ip Setting
From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):
- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
- Controls which IP kubelet advertises to API server
- Determines routing for pod-to-pod traffic
### Network Connectivity Requirements
Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):
**Control Plane Nodes:**
- TCP 50000: apid (used by talosctl, control plane nodes)
- TCP 50001: trustd (used by worker nodes)
**Worker Nodes:**
- TCP 50000: apid (used by control plane nodes)
## Lessons Learned
### What Went Wrong
1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies
2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout
3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration
4. **Config file format confusion** - Multiple formats caused application errors
### What Went Right
1.**Global Talos API access** - Prevents future lockouts
2.**GitOps with Flux** - Automatic workload recovery after etcd bootstrap
3.**Maintenance mode recovery** - Reliable way to regain access
4.**External IP baseline** - Stable configuration to fall back to
### Best Practices Going Forward
1. **Test on workers first** - Validate VLAN setup before touching control plane
2. **Document all configs** - Keep clear record of working configurations
3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage
4. **Backup etcd** - Regular etcd backups to avoid data loss
5. **Plan for downtime** - Maintenance windows for control plane changes
## Success Criteria
Migration is successful when:
1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
3. ✅ All pods healthy and communicating
4. ✅ Longhorn replication working
5. ✅ External services (Mastodon, Pixelfed, etc.) operational
6. ✅ No performance degradation
7. ✅ 24-hour stability test passed
## Additional Resources
- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)
- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)
- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
- [Cilium Documentation](https://docs.cilium.io/)
## Contact & Maintenance
**Last Updated:** 2025-11-20
**Cluster:** keyboardvagabond.com
**Status:** Nodes operational on external IPs, VLAN migration pending