353 lines
11 KiB
Markdown
353 lines
11 KiB
Markdown
|
|
# VLAN Node-IP Migration Plan
|
||
|
|
|
||
|
|
## Document Purpose
|
||
|
|
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.
|
||
|
|
|
||
|
|
## Current State (2025-11-20)
|
||
|
|
|
||
|
|
### Cluster Status
|
||
|
|
- **n1** (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
|
||
|
|
- **n2** (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
|
||
|
|
- **n3** (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅
|
||
|
|
|
||
|
|
### Current Configuration
|
||
|
|
All nodes are using **external IPs** for `node-ip`:
|
||
|
|
- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
|
||
|
|
- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
|
||
|
|
- n3: `node-ip: <NODE_3_EXTERNAL_IP>`
|
||
|
|
|
||
|
|
### Issues with Current Setup
|
||
|
|
1. ❌ Inter-node pod traffic uses **public internet** (external IPs)
|
||
|
|
2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused**
|
||
|
|
3. ❌ Less secure (traffic exposed on public network)
|
||
|
|
4. ❌ Potentially slower for inter-pod communication
|
||
|
|
|
||
|
|
### What's Working
|
||
|
|
1. ✅ All nodes joined and operational
|
||
|
|
2. ✅ Cilium CNI deployed and functional
|
||
|
|
3. ✅ Global Talos API access enabled (ports 50000, 50001)
|
||
|
|
4. ✅ GitOps with Flux operational
|
||
|
|
5. ✅ Core infrastructure recovering
|
||
|
|
|
||
|
|
## Goal: VLAN Migration
|
||
|
|
|
||
|
|
### Target Configuration
|
||
|
|
All nodes using **VLAN IPs** for `node-ip`:
|
||
|
|
- n1: `<NODE_1_IP>` (control plane)
|
||
|
|
- n2: `<NODE_2_IP>` (worker)
|
||
|
|
- n3: `<NODE_3_IP>` (worker)
|
||
|
|
|
||
|
|
### Benefits
|
||
|
|
1. ✅ 100Mbps dedicated bandwidth for inter-node traffic
|
||
|
|
2. ✅ Private network (more secure)
|
||
|
|
3. ✅ Lower latency for pod-to-pod communication
|
||
|
|
4. ✅ Production-ready architecture
|
||
|
|
|
||
|
|
## Issues Encountered During Initial Attempt
|
||
|
|
|
||
|
|
### Issue 1: API Server Endpoint Mismatch
|
||
|
|
**Problem:**
|
||
|
|
- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
|
||
|
|
- Worker nodes with VLAN node-ip couldn't reach API server
|
||
|
|
- n3 failed to join cluster
|
||
|
|
|
||
|
|
**Solution:**
|
||
|
|
Must choose ONE of:
|
||
|
|
- **Option A:** Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
|
||
|
|
- **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)
|
||
|
|
|
||
|
|
**Recommended:** Option A (simpler, no DNS changes needed)
|
||
|
|
|
||
|
|
### Issue 2: Cluster Lockout After n1 Migration
|
||
|
|
**Problem:**
|
||
|
|
- When n1 was changed to VLAN node ip, all access was lost
|
||
|
|
- Tailscale pods couldn't start (needed API server access)
|
||
|
|
- Cilium policies blocked external Talos API access
|
||
|
|
- Complete lockout - no `kubectl` or `talosctl` access
|
||
|
|
|
||
|
|
**Root Cause:**
|
||
|
|
- Tailscale requires API server to be reachable from external network
|
||
|
|
- Once n1 switched to VLAN-only, Tailscale couldn't connect
|
||
|
|
- Without Tailscale, no VPN access to cluster
|
||
|
|
|
||
|
|
**Solution:**
|
||
|
|
- ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies
|
||
|
|
- This prevents future lockouts during network migrations
|
||
|
|
|
||
|
|
### Issue 3: etcd Data Loss After Bootstrap
|
||
|
|
**Problem:**
|
||
|
|
- After multiple reboots/config changes, etcd lost its data
|
||
|
|
- `/var/lib/etcd/member` directory was empty
|
||
|
|
- etcd stuck waiting to join cluster
|
||
|
|
|
||
|
|
**Solution:**
|
||
|
|
- Ran `talosctl bootstrap` to reinitialize etcd
|
||
|
|
- GitOps (Flux) automatically redeployed all workloads from Git
|
||
|
|
- Longhorn has S3 backups for persistent data recovery
|
||
|
|
|
||
|
|
### Issue 4: Machine Config Format Issues
|
||
|
|
**Problem:**
|
||
|
|
- `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper)
|
||
|
|
- YAML indentation errors in various config files
|
||
|
|
- SOPS encryption complications
|
||
|
|
|
||
|
|
**Solution:**
|
||
|
|
- Use `.decrypted~` files for direct manipulation
|
||
|
|
- Careful YAML indentation (list items with inline keys)
|
||
|
|
- Apply configs in maintenance mode with `--insecure` flag
|
||
|
|
|
||
|
|
## Migration Plan: Phased VLAN Rollout
|
||
|
|
|
||
|
|
### Prerequisites
|
||
|
|
1. ✅ All nodes in stable, working state (DONE)
|
||
|
|
2. ✅ Global Talos API access enabled (DONE)
|
||
|
|
3. ✅ GitOps with Flux operational (DONE)
|
||
|
|
4. ⏳ Verify Longhorn S3 backups are current
|
||
|
|
5. ⏳ Document current pod placement and workload state
|
||
|
|
|
||
|
|
### Phase 1: Prepare Configurations
|
||
|
|
|
||
|
|
#### 1.1 Update Machine Configs for VLAN
|
||
|
|
For each node, update the machine config:
|
||
|
|
|
||
|
|
**n1 (control plane):**
|
||
|
|
```yaml
|
||
|
|
machine:
|
||
|
|
kubelet:
|
||
|
|
nodeIP:
|
||
|
|
validSubnets:
|
||
|
|
- 10.132.0.0/24 # Force VLAN IP selection
|
||
|
|
```
|
||
|
|
|
||
|
|
**n2 & n3 (workers):**
|
||
|
|
```yaml
|
||
|
|
cluster:
|
||
|
|
controlPlane:
|
||
|
|
endpoint: https://<NODE_1_IP>:6443 # Use n1's VLAN IP
|
||
|
|
|
||
|
|
machine:
|
||
|
|
kubelet:
|
||
|
|
nodeIP:
|
||
|
|
validSubnets:
|
||
|
|
- 10.132.0.0/24 # Force VLAN IP selection
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 1.2 Update Cilium Configuration
|
||
|
|
Verify Cilium is configured to use VLAN interface:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# manifests/infrastructure/cilium/release.yaml
|
||
|
|
values:
|
||
|
|
kubeProxyReplacement: strict
|
||
|
|
# Ensure Cilium detects and uses VLAN interface
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 2: Test with Worker Node First
|
||
|
|
|
||
|
|
#### 2.1 Migrate n3 (Worker Node)
|
||
|
|
Test VLAN migration on a worker node first:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Apply updated config to n3
|
||
|
|
cd /Users/<USERNAME>/src/keyboard-vagabond
|
||
|
|
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
|
||
|
|
--file machineconfigs/n3-vlan.yaml
|
||
|
|
|
||
|
|
# Wait for n3 to reboot
|
||
|
|
sleep 60
|
||
|
|
|
||
|
|
# Verify n3 joined with VLAN IP
|
||
|
|
kubectl get nodes -o wide
|
||
|
|
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 2.2 Validate n3 Connectivity
|
||
|
|
```bash
|
||
|
|
# Check Cilium status on n3
|
||
|
|
kubectl exec -n kube-system ds/cilium -- cilium status
|
||
|
|
|
||
|
|
# Verify pod-to-pod communication
|
||
|
|
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>
|
||
|
|
|
||
|
|
# Check inter-node traffic is using VLAN
|
||
|
|
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 2.3 Decision Point
|
||
|
|
- ✅ If successful: Proceed to Phase 3
|
||
|
|
- ❌ If issues: Revert n3 to external IP (rollback plan)
|
||
|
|
|
||
|
|
### Phase 3: Migrate Second Worker (n2)
|
||
|
|
|
||
|
|
Repeat Phase 2 steps for n2:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
|
||
|
|
--file machineconfigs/n2-vlan.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
Validate connectivity and inter-node traffic on VLAN.
|
||
|
|
|
||
|
|
### Phase 4: Migrate Control Plane (n1)
|
||
|
|
|
||
|
|
**CRITICAL:** This is the most sensitive step.
|
||
|
|
|
||
|
|
#### 4.1 Prepare for Downtime
|
||
|
|
- ⚠️ **Expected downtime:** 2-5 minutes
|
||
|
|
- Inform users of maintenance window
|
||
|
|
- Ensure workers (n2, n3) are stable
|
||
|
|
|
||
|
|
#### 4.2 Apply Config to n1
|
||
|
|
```bash
|
||
|
|
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
|
||
|
|
--file machineconfigs/n1-vlan.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 4.3 Monitor API Server Recovery
|
||
|
|
```bash
|
||
|
|
# Watch for API server to come back online
|
||
|
|
watch -n 2 "kubectl get nodes"
|
||
|
|
|
||
|
|
# Check etcd health
|
||
|
|
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status
|
||
|
|
|
||
|
|
# Verify all nodes on VLAN
|
||
|
|
kubectl get nodes -o wide
|
||
|
|
```
|
||
|
|
|
||
|
|
### Phase 5: Validation & Verification
|
||
|
|
|
||
|
|
#### 5.1 Verify VLAN Traffic
|
||
|
|
```bash
|
||
|
|
# Check network traffic on VLAN interface (enp9s0)
|
||
|
|
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
|
||
|
|
echo "=== $node ==="
|
||
|
|
talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 5.2 Verify Pod Connectivity
|
||
|
|
```bash
|
||
|
|
# Deploy test pods across nodes
|
||
|
|
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
|
||
|
|
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
|
||
|
|
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'
|
||
|
|
|
||
|
|
# Test cross-node communication
|
||
|
|
kubectl exec test-n1 -- curl <test-n2-pod-ip>
|
||
|
|
kubectl exec test-n2 -- curl <test-n3-pod-ip>
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 5.3 Monitor for 24 Hours
|
||
|
|
- Watch for network issues
|
||
|
|
- Monitor Longhorn replication
|
||
|
|
- Check application logs
|
||
|
|
- Verify external services (Mastodon, Pixelfed, etc.)
|
||
|
|
|
||
|
|
## Rollback Plan
|
||
|
|
|
||
|
|
### If Issues Occur During Migration
|
||
|
|
|
||
|
|
#### Rollback Individual Node
|
||
|
|
```bash
|
||
|
|
# Create rollback config with external IP
|
||
|
|
# Apply to affected node
|
||
|
|
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
|
||
|
|
--file machineconfigs/<node>-external.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Complete Cluster Rollback
|
||
|
|
If systemic issues occur:
|
||
|
|
1. Revert n1 first (control plane is critical)
|
||
|
|
2. Revert n2 and n3
|
||
|
|
3. Verify all nodes back on external IPs
|
||
|
|
4. Investigate root cause before retry
|
||
|
|
|
||
|
|
### Emergency Recovery (If Locked Out)
|
||
|
|
|
||
|
|
If you lose access during migration:
|
||
|
|
|
||
|
|
1. **Access via NetCup Console:**
|
||
|
|
- Boot node into maintenance mode via NetCup dashboard
|
||
|
|
- Apply rollback config with `--insecure` flag
|
||
|
|
|
||
|
|
2. **Rescue Mode (Last Resort):**
|
||
|
|
- Boot into NetCup rescue system
|
||
|
|
- Mount XFS partitions (need `xfsprogs`)
|
||
|
|
- Manually edit configs (complex, avoid if possible)
|
||
|
|
|
||
|
|
## Key Talos Configuration References
|
||
|
|
|
||
|
|
### Multihoming Configuration
|
||
|
|
According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
machine:
|
||
|
|
kubelet:
|
||
|
|
nodeIP:
|
||
|
|
validSubnets:
|
||
|
|
- 10.132.0.0/24 # Selects IP from VLAN subnet
|
||
|
|
```
|
||
|
|
|
||
|
|
### Kubelet node-ip Setting
|
||
|
|
From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):
|
||
|
|
- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
|
||
|
|
- Controls which IP kubelet advertises to API server
|
||
|
|
- Determines routing for pod-to-pod traffic
|
||
|
|
|
||
|
|
### Network Connectivity Requirements
|
||
|
|
Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):
|
||
|
|
|
||
|
|
**Control Plane Nodes:**
|
||
|
|
- TCP 50000: apid (used by talosctl, control plane nodes)
|
||
|
|
- TCP 50001: trustd (used by worker nodes)
|
||
|
|
|
||
|
|
**Worker Nodes:**
|
||
|
|
- TCP 50000: apid (used by control plane nodes)
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
### What Went Wrong
|
||
|
|
1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies
|
||
|
|
2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout
|
||
|
|
3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration
|
||
|
|
4. **Config file format confusion** - Multiple formats caused application errors
|
||
|
|
|
||
|
|
### What Went Right
|
||
|
|
1. ✅ **Global Talos API access** - Prevents future lockouts
|
||
|
|
2. ✅ **GitOps with Flux** - Automatic workload recovery after etcd bootstrap
|
||
|
|
3. ✅ **Maintenance mode recovery** - Reliable way to regain access
|
||
|
|
4. ✅ **External IP baseline** - Stable configuration to fall back to
|
||
|
|
|
||
|
|
### Best Practices Going Forward
|
||
|
|
1. **Test on workers first** - Validate VLAN setup before touching control plane
|
||
|
|
2. **Document all configs** - Keep clear record of working configurations
|
||
|
|
3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage
|
||
|
|
4. **Backup etcd** - Regular etcd backups to avoid data loss
|
||
|
|
5. **Plan for downtime** - Maintenance windows for control plane changes
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
Migration is successful when:
|
||
|
|
1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
|
||
|
|
2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
|
||
|
|
3. ✅ All pods healthy and communicating
|
||
|
|
4. ✅ Longhorn replication working
|
||
|
|
5. ✅ External services (Mastodon, Pixelfed, etc.) operational
|
||
|
|
6. ✅ No performance degradation
|
||
|
|
7. ✅ 24-hour stability test passed
|
||
|
|
|
||
|
|
## Additional Resources
|
||
|
|
|
||
|
|
- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)
|
||
|
|
- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)
|
||
|
|
- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
|
||
|
|
- [Cilium Documentation](https://docs.cilium.io/)
|
||
|
|
|
||
|
|
## Contact & Maintenance
|
||
|
|
|
||
|
|
**Last Updated:** 2025-11-20
|
||
|
|
**Cluster:** keyboardvagabond.com
|
||
|
|
**Status:** Nodes operational on external IPs, VLAN migration pending
|
||
|
|
|