redaction (#1)

Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
This commit was merged in pull request #1.
This commit is contained in:
2025-12-24 13:40:47 +00:00
committed by michael_dileo
parent 612235d52b
commit 7327d77dcd
333 changed files with 39286 additions and 1 deletions

View File

@@ -0,0 +1,169 @@
# Cilium Host Firewall Policy Audit Mode Testing
## Overview
This guide explains how to test Cilium host firewall policies in audit mode before applying them in enforcement mode. This prevents accidentally locking yourself out of the cluster.
## Prerequisites
- `kubectl` configured and working
- Access to the cluster (via Tailscale or direct connection)
- Cilium installed and running
## Quick Start
Run the automated test script:
```bash
./tools/test-cilium-policy-audit.sh
```
This script will:
1. Find the Cilium pod
2. Locate the host endpoint (identity 1)
3. Enable PolicyAuditMode
4. Start monitoring policy verdicts
5. Test basic connectivity
6. Show audit log entries
## Manual Testing Steps
### 1. Find Cilium Pod
```bash
kubectl -n kube-system get pods -l "k8s-app=cilium"
```
### 2. Find Host Endpoint
The host endpoint has identity `1`. Find its endpoint ID:
```bash
CILIUM_POD=$(kubectl -n kube-system get pods -l "k8s-app=cilium" -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n kube-system ${CILIUM_POD} -- \
cilium endpoint list -o jsonpath='{[?(@.status.identity.id==1)].id}'
```
### 3. Enable Audit Mode
```bash
kubectl exec -n kube-system ${CILIUM_POD} -- \
cilium endpoint config <ENDPOINT_ID> PolicyAuditMode=Enabled
```
### 4. Verify Audit Mode
```bash
kubectl exec -n kube-system ${CILIUM_POD} -- \
cilium endpoint config <ENDPOINT_ID> | grep PolicyAuditMode
```
Should show: `PolicyAuditMode : Enabled`
### 5. Start Monitoring
In a separate terminal, start monitoring policy verdicts:
```bash
kubectl exec -n kube-system ${CILIUM_POD} -- \
cilium monitor -t policy-verdict --related-to <ENDPOINT_ID>
```
### 6. Test Connectivity
While monitoring, test various connections:
**Kubernetes API:**
```bash
kubectl get nodes
kubectl get pods -A
```
**Talos API (if talosctl available):**
```bash
talosctl -n <NODE_IP> time
talosctl -n <NODE_IP> version
```
**Cluster Internal:**
```bash
kubectl get services -A
```
### 7. Review Audit Log
Look for entries in the monitor output:
- `action allow` - Traffic allowed by policy
- `action audit` - Traffic would be denied but is being audited (not dropped)
- `action deny` - Traffic denied (only in enforcement mode)
### 8. Disable Audit Mode (When Ready)
Once you've verified all necessary traffic is allowed:
```bash
kubectl exec -n kube-system ${CILIUM_POD} -- \
cilium endpoint config <ENDPOINT_ID> PolicyAuditMode=Disabled
```
## Expected Results
With the current policies, you should see `action allow` for:
1. **Kubernetes API (6443)** from:
- Tailscale network (100.64.0.0/10)
- VLAN subnet (10.132.0.0/24)
- VIP (<VIP_IP>)
- External IPs (152.53.x.x)
- Cluster entities
2. **Talos API (50000, 50001)** from:
- Tailscale network
- VLAN subnet
- VIP
- External IPs
- Cluster entities
3. **Cluster Internal Traffic** from:
- Cluster entities
- Remote nodes
- Host
## Troubleshooting
### No Policy Verdicts Appearing
- Ensure PolicyAuditMode is enabled
- Check that policies are actually applied: `kubectl get ciliumclusterwidenetworkpolicies`
- Generate more traffic to trigger policy evaluation
### Seeing `action audit` (Would Be Denied)
This means traffic would be blocked in enforcement mode. Review your policies and add appropriate rules.
### Locked Out After Disabling Audit Mode
If you lose access after disabling audit mode:
1. Use the Hetzner Robot firewall escape hatch (if configured)
2. Or access via Tailscale network (should still work)
3. Re-enable audit mode via direct node access if needed
## Policy Verification Checklist
Before disabling audit mode, verify:
- [ ] Kubernetes API accessible from Tailscale
- [ ] Kubernetes API accessible from VLAN
- [ ] Talos API accessible from Tailscale
- [ ] Talos API accessible from VLAN
- [ ] Cluster internal communication working
- [ ] Worker nodes can reach control plane
- [ ] No unexpected `action audit` entries for critical services
## References
- [Cilium Host Firewall Documentation](https://docs.cilium.io/en/stable/policy/language/#host-firewall)
- [Policy Audit Mode Guide](https://datavirke.dk/posts/bare-metal-kubernetes-part-2-cilium-and-firewalls/#policy-audit-mode)
- [Cilium Network Policies](https://docs.cilium.io/en/stable/policy/language/)

View File

@@ -0,0 +1,329 @@
# Cloudflare Tunnel to Nginx Ingress Migration
## Project Overview
**Goal**: Route Cloudflare Zero Trust tunnel traffic through nginx ingress controller to enable unified request metrics collection for all fediverse applications.
**Problem**: Currently only Harbor registry shows up in nginx ingress metrics because fediverse apps (PieFed, Mastodon, Pixelfed, BookWyrm) use Cloudflare tunnels that bypass nginx ingress entirely.
**Solution**: Reconfigure Cloudflare tunnels to route traffic through nginx ingress controller instead of directly to application services.
## Current vs Target Architecture
### Current Architecture
```
Internet → Cloudflare Tunnel → Direct to App Services → Fediverse Apps (NO METRICS)
Internet → External IPs → nginx ingress → Harbor (HAS METRICS)
```
### Target Architecture
```
Internet → Cloudflare Tunnel → nginx ingress → All Applications (UNIFIED METRICS)
```
## Migration Strategy
**Approach**: Gradual rollout per application to minimize risk and allow monitoring at each stage.
**Order**: BookWyrm → Pixelfed → PieFed → Mastodon (lowest to highest traffic/criticality)
## Application Migration Checklist
### Phase 1: BookWyrm (STARTING) ⏳
- [ ] **Pre-migration checks**
- [ ] Verify BookWyrm ingress configuration
- [ ] Baseline nginx ingress resource usage
- [ ] Test nginx ingress accessibility from within cluster
- [ ] Document current Cloudflare tunnel config for BookWyrm
- [ ] **Migration execution**
- [ ] Update Cloudflare tunnel: `bookwyrm.keyboardvagabond.com``http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- [ ] Test BookWyrm accessibility immediately after change
- [ ] Verify nginx metrics show BookWyrm requests
- [ ] **Post-migration monitoring (24-48 hours)**
- [ ] Monitor nginx ingress pod CPU/memory usage
- [ ] Check BookWyrm response times and error rates
- [ ] Verify BookWyrm appears in nginx metrics with expected traffic
- [ ] Confirm no nginx ingress errors in logs
### Phase 2: Pixelfed (PENDING) 📋
- [ ] **Pre-migration checks**
- [ ] Verify lessons learned from BookWyrm migration
- [ ] Check nginx resource usage after BookWyrm
- [ ] Baseline Pixelfed performance metrics
- [ ] **Migration execution**
- [ ] Update Cloudflare tunnel: `pixelfed.keyboardvagabond.com` → nginx ingress
- [ ] Test and monitor as per BookWyrm process
- [ ] **Post-migration monitoring**
- [ ] Monitor combined BookWyrm + Pixelfed traffic impact
### Phase 3: PieFed (PENDING) 📋
- [ ] **Pre-migration checks**
- [ ] PieFed has heaviest ActivityPub federation traffic
- [ ] Ensure nginx can handle federation bursts
- [ ] Review PieFed rate limiting configuration
- [ ] **Migration execution**
- [ ] Update Cloudflare tunnel: `piefed.keyboardvagabond.com` → nginx ingress
- [ ] Monitor federation traffic patterns closely
- [ ] **Post-migration monitoring**
- [ ] Watch for ActivityPub federation performance impact
- [ ] Verify rate limiting still works effectively
### Phase 4: Mastodon (PENDING) 📋
- [ ] **Pre-migration checks**
- [ ] Most critical application - proceed with extra caution
- [ ] Verify all previous migrations stable
- [ ] Review Mastodon streaming service impact
- [ ] **Migration execution**
- [ ] Update Cloudflare tunnel: `mastodon.keyboardvagabond.com` → nginx ingress
- [ ] Update streaming tunnel: `streamingmastodon.keyboardvagabond.com` → nginx ingress
- [ ] **Post-migration monitoring**
- [ ] Monitor Mastodon federation and streaming performance
- [ ] Verify WebSocket connections work correctly
## Current Configuration
### Nginx Ingress Service
```bash
# Main ingress controller service (internal)
kubectl get svc ingress-nginx-controller -n ingress-nginx
# ClusterIP: 10.101.136.40, Port: 80
# Public service (external IPs for Harbor)
kubectl get svc ingress-nginx-public -n ingress-nginx
# LoadBalancer: 10.107.187.45, ExternalIPs: <NODE_1_EXTERNAL_IP>,<NODE_2_EXTERNAL_IP>
```
### Current Cloudflare Tunnel Routes (TO BE CHANGED)
```
bookwyrm.keyboardvagabond.com → http://bookwyrm-web.bookwyrm-application.svc.cluster.local:80
pixelfed.keyboardvagabond.com → http://pixelfed-web.pixelfed-application.svc.cluster.local:80
piefed.keyboardvagabond.com → http://piefed-web.piefed-application.svc.cluster.local:80
mastodon.keyboardvagabond.com → http://mastodon-web.mastodon-application.svc.cluster.local:3000
streamingmastodon.keyboardvagabond.com → http://mastodon-streaming.mastodon-application.svc.cluster.local:4000
```
### Target Cloudflare Tunnel Routes
```
bookwyrm.keyboardvagabond.com → http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80
pixelfed.keyboardvagabond.com → http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80
piefed.keyboardvagabond.com → http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80
mastodon.keyboardvagabond.com → http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80
streamingmastodon.keyboardvagabond.com → http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80
```
## Monitoring Commands
### Pre-Migration Baseline
```bash
# Check nginx ingress resource usage
kubectl top pods -n ingress-nginx
# Check current request metrics (should only show Harbor)
# Your existing query:
# (sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (host) / sum(rate(nginx_ingress_controller_requests[5m])) by (host)) * 100
# Monitor nginx ingress logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
```
### Post-Migration Verification
```bash
# Verify nginx metrics include new application
# Run your metrics query - should now show BookWyrm traffic
# Check nginx ingress is handling traffic
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=20 | grep bookwyrm
# Monitor resource impact
kubectl top pods -n ingress-nginx
```
## Rollback Procedures
### Quick Rollback (Per Application)
1. **Immediate**: Revert Cloudflare tunnel configuration in Zero Trust dashboard
2. **Verify**: Test application accessibility
3. **Monitor**: Confirm traffic flows correctly
### Full Rollback (All Applications)
1. Revert all Cloudflare tunnel configurations to direct service routing
2. Verify all applications accessible
3. Confirm metrics collection returns to Harbor-only state
## Risk Mitigation
### Resource Monitoring
- **nginx Pod Resources**: Watch CPU/memory usage after each migration
- **Response Times**: Monitor application response times for degradation
- **Error Rates**: Check for increased 5xx errors in nginx logs
### Traffic Impact Assessment
- **Federation Traffic**: Especially important for PieFed and Mastodon
- **Rate Limiting**: Verify existing rate limits still function correctly
- **WebSocket Connections**: Critical for Mastodon streaming
## Success Criteria
**Migration Complete When**:
- All fediverse applications route through nginx ingress
- Unified metrics show traffic for all applications
- No performance degradation observed
- All rate limiting and security policies functional
- nginx ingress resource usage within acceptable limits
## Notes & Lessons Learned
### Phase 1 (BookWyrm) - Status: PRE-MIGRATION COMPLETE ✅
**Pre-Migration Checks (2025-08-25)**:
-**BookWyrm Ingress**: Correctly configured with host `bookwyrm.keyboardvagabond.com`, nginx class, proper CORS settings
-**BookWyrm Service**: `bookwyrm-web.bookwyrm-application.svc.cluster.local:80` accessible (ClusterIP: 10.96.26.11)
-**Nginx Baseline Resources**:
- n1 (625nz): 9m CPU, 174Mi memory
- n2 (br8rg): 4m CPU, 169Mi memory
- n3 (rkddn): 14m CPU, 159Mi memory
-**Nginx Accessibility Test**: Successfully accessed BookWyrm through nginx ingress with correct Host header
- Response: HTTP 200, BookWyrm page served correctly
- CORS headers applied properly
- No nginx routing issues
**Current Cloudflare Tunnel Config**:
```
bookwyrm.keyboardvagabond.com → http://bookwyrm-web.bookwyrm-application.svc.cluster.local:80
```
**Ready for Migration**: All pre-checks passed. Nginx ingress can successfully route BookWyrm traffic.
**Migration Executed (2025-08-25 16:06 UTC)**: ✅ SUCCESS
- **Cloudflare Tunnel Updated**: `bookwyrm.keyboardvagabond.com``http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- **Immediate Verification**: BookWyrm web UI accessible, no downtime
- **nginx Logs Confirmation**: BookWyrm traffic flowing through nginx ingress:
```
136.41.98.74 - "GET / HTTP/1.1" 200 [bookwyrm-application-bookwyrm-web-80]
143.110.147.80 - "POST /inbox HTTP/1.1" 200 [bookwyrm-application-bookwyrm-web-80]
```
- **Resource Impact**: Minimal increase in nginx CPU (9-15m cores), memory stable (~170Mi)
- **Next**: Monitor for 24-48 hours, verify metrics collection
**METRICS VERIFICATION**: ✅ SUCCESS!
- **BookWyrm now appears in nginx metrics query**: `bookwyrm.keyboardvagabond.com` visible alongside `<YOUR_REGISTRY_URL>`
- **Unified metrics collection achieved**: Both Harbor and BookWyrm traffic now measured through nginx ingress
- **Phase 1 COMPLETE**: Ready to monitor for stability before Phase 2
### Phase 2 (Pixelfed) - Status: PRE-MIGRATION STARTING ⏳
**Lessons Learned from BookWyrm**:
- Migration process works flawlessly
- nginx ingress handles additional load without issues
- Metrics integration successful
- Zero downtime achieved
**Pre-Migration Checks (2025-08-25)**: ✅ COMPLETE
- ✅ **Pixelfed Ingress**: Correctly configured with host `pixelfed.keyboardvagabond.com`, nginx class, 20MB upload limit, rate limiting
- ✅ **Pixelfed Service**: `pixelfed-web.pixelfed-application.svc.cluster.local:80` accessible (ClusterIP: 10.97.130.244)
- ✅ **nginx Post-BookWyrm Resources**: Stable performance after BookWyrm migration
- n1 (625nz): 8m CPU, 173Mi memory
- n2 (br8rg): 10m CPU, 169Mi memory
- n3 (rkddn): 11m CPU, 159Mi memory
- ✅ **nginx Accessibility Test**: Successfully accessed Pixelfed through nginx ingress with correct Host header
- Response: HTTP 200, Pixelfed Laravel application served correctly
- Proper session cookies and security headers
- No nginx routing issues
**Current Cloudflare Tunnel Config**:
```
pixelfed.keyboardvagabond.com → http://pixelfed-web.pixelfed-application.svc.cluster.local:80
```
**Ready for Migration**: All pre-checks passed. nginx ingress can successfully route Pixelfed traffic.
**Migration Executed (2025-08-25 16:19 UTC)**: ✅ SUCCESS
- **Cloudflare Tunnel Updated**: `pixelfed.keyboardvagabond.com` → `http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- **Immediate Verification**: Pixelfed web UI accessible, no downtime
- **nginx Logs Confirmation**: Pixelfed traffic flowing through nginx ingress:
```
136.41.98.74 - "HEAD / HTTP/1.1" 200 [pixelfed-application-pixelfed-web-80]
136.41.98.74 - "GET / HTTP/1.1" 302 [pixelfed-application-pixelfed-web-80]
136.41.98.74 - "GET /sw.js HTTP/1.1" 200 [pixelfed-application-pixelfed-web-80]
```
- **Resource Impact**: Stable nginx performance (3-10m CPU cores), memory unchanged
- **Multi-App Success**: Both BookWyrm AND Pixelfed now routing through nginx ingress
- **Metrics Fix**: Updated query to include 3xx redirects as success (`status=~"[23].."`)
- **PHASE 2 COMPLETE**: Pixelfed metrics now showing correctly in unified dashboard
### Phase 3 (PieFed) - Status: PRE-MIGRATION STARTING ⏳
**Lessons Learned from BookWyrm + Pixelfed**:
- Migration process consistently successful across different app types
- nginx ingress handles additional load without issues
- Metrics integration working with proper 2xx+3xx success criteria
- Zero downtime achieved for both migrations
- Traffic patterns clearly visible in nginx logs
**Pre-Migration Checks (2025-08-25)**: ✅ COMPLETE
- ✅ **PieFed Ingress**: Correctly configured with host `piefed.keyboardvagabond.com`, nginx class, 20MB upload limit, rate limiting (100/min)
- ✅ **PieFed Service**: `piefed-web.piefed-application.svc.cluster.local:80` accessible (ClusterIP: 10.104.62.239)
- ✅ **nginx Post-2-Apps Resources**: Stable performance after BookWyrm + Pixelfed migrations
- n1 (625nz): 10m CPU, 173Mi memory
- n2 (br8rg): 16m CPU, 169Mi memory
- n3 (rkddn): 3m CPU, 161Mi memory
- ✅ **nginx Accessibility Test**: Successfully accessed PieFed through nginx ingress with correct Host header
- Response: HTTP 200, PieFed application served correctly (343KB response)
- Proper security headers and CSP policies
- Flask session handling working correctly
- ✅ **Federation Traffic Assessment**: **HEAVY** ActivityPub load confirmed
- **58 federation requests** in last 30 Cloudflare tunnel logs
- Constant ActivityPub `/inbox` POST requests from multiple Lemmy instances
- Sources: lemmy.dbzer0.com, lemmy.world, and others
- This will significantly increase nginx ingress load
**Current Cloudflare Tunnel Config**:
```
piefed.keyboardvagabond.com → http://piefed-web.piefed-application.svc.cluster.local:80
```
**Ready for Migration**: All pre-checks passed. ⚠️ **CAUTION**: PieFed has the heaviest federation traffic - monitor nginx closely during/after migration.
**Migration Executed (2025-08-25 17:26 UTC)**: ✅ SUCCESS
- **Cloudflare Tunnel Updated**: `piefed.keyboardvagabond.com` → `http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- **Immediate Verification**: PieFed web UI accessible, no downtime
- **nginx Logs Confirmation**: **HEAVY** federation traffic flowing through nginx ingress:
```
135.181.143.221 - "POST /inbox HTTP/1.1" 200 [piefed-application-piefed-web-80]
135.181.143.221 - "POST /inbox HTTP/1.1" 200 [piefed-application-piefed-web-80]
Multiple ActivityPub federation requests per second from lemmy.world
```
- **Resource Impact**: nginx ingress handling heavy load excellently
- CPU: 9-17m cores (slight increase, well within limits)
- Memory: 160-174Mi (stable)
- Response times: 0.045-0.066s (excellent performance)
- **Load Balancing**: Traffic properly distributed across multiple PieFed pods
- **Federation Success**: All ActivityPub requests returning HTTP 200
- **PHASE 3 COMPLETE**: PieFed successfully migrated with heaviest traffic load
### Phase 4 (Mastodon) - Status: COMPLETE ✅
**Migration Executed (2025-08-25 17:36 UTC)**: ✅ SUCCESS
- **Issue Encountered**: Complex nginx rate limiting configuration caused host header validation failures
- **Root Cause**: `server-snippet` and `configuration-snippet` annotations interfered with proper request routing
- **Solution**: Simplified ingress configuration by removing complex rate limiting annotations
- **Fix Process**:
1. Suspended Flux applications to prevent config reversion
2. Deleted and recreated ingress resources to clear nginx cache
3. Applied clean ingress configuration
- **Cloudflare Tunnel Updated**: Both Mastodon routes to nginx ingress:
- `mastodon.keyboardvagabond.com` → `http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- `streamingmastodon.keyboardvagabond.com` → `http://ingress-nginx-controller.ingress-nginx.svc.cluster.local:80`
- **Immediate Verification**: Mastodon web UI accessible, HTTP 200 responses
- **nginx Logs Confirmation**: Mastodon traffic flowing through nginx ingress:
```
136.41.98.74 - "HEAD / HTTP/1.1" 200 [mastodon-application-mastodon-web-3000]
```
- **Performance**: Fast response times (0.100s), all security headers working correctly
- **🎉 MIGRATION COMPLETE**: All 4 fediverse applications successfully migrated to unified nginx ingress routing!
---
**Created**: 2025-08-25
**Last Updated**: 2025-08-25
**Status**: Phase 1 (BookWyrm) Starting

174
docs/NODE-ADDITION-GUIDE.md Normal file
View File

@@ -0,0 +1,174 @@
# Adding a New Node for Nginx Ingress Metrics Collection
This guide documents the steps required to add a new node to the cluster and ensure nginx ingress controller metrics are properly collected from it.
## Overview
The nginx ingress controller is deployed as a **DaemonSet** (kind: DaemonSet), which means it automatically deploys one pod per node. However, for metrics collection to work properly, additional configuration steps are required.
## Current Configuration
Currently, the cluster has 3 nodes with metrics collection configured for:
- **n1 (<NODE_1_EXTERNAL_IP>)**: Control plane + worker
- **n2 (<NODE_2_EXTERNAL_IP>)**: Worker
- **n3 (<NODE_3_EXTERNAL_IP>)**: Worker
## Steps to Add a New Node
### 1. Add the Node to Kubernetes Cluster
Follow your standard node addition process (this is outside the scope of this guide). Ensure the new node:
- Is properly joined to the cluster
- Has the nginx ingress controller pod deployed (should happen automatically due to DaemonSet)
- Is accessible on the cluster network
### 2. Verify Nginx Ingress Controller Deployment
Check that the nginx ingress controller pod is running on the new node:
```bash
kubectl get pods -n ingress-nginx -o wide
```
Look for a pod on your new node. The nginx ingress controller should automatically deploy due to the DaemonSet configuration.
### 3. Update OpenTelemetry Collector Configuration
**File to modify**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
**Current configuration** (lines 217-219):
```yaml
- job_name: 'nginx-ingress'
static_configs:
- targets: ['<NODE_1_EXTERNAL_IP>:10254', '<NODE_2_EXTERNAL_IP>:10254', '<NODE_3_EXTERNAL_IP>:10254']
```
**Add the new node IP** to the targets list:
```yaml
- job_name: 'nginx-ingress'
static_configs:
- targets: ['<NODE_1_EXTERNAL_IP>:10254', '<NODE_2_EXTERNAL_IP>:10254', '<NODE_3_EXTERNAL_IP>:10254', 'NEW_NODE_IP:10254']
```
Replace `NEW_NODE_IP` with the actual IP address of your new node.
### 4. Update Host Firewall Policies (if applicable)
**File to check**: `manifests/infrastructure/cluster-policies/host-fw-worker-nodes.yaml`
Ensure the firewall allows nginx metrics port access (should already be configured):
```yaml
# NGINX Ingress Controller metrics port
- fromEntities:
- cluster
toPorts:
- ports:
- port: "10254"
protocol: "TCP" # NGINX Ingress metrics
```
### 5. Apply the Configuration Changes
```bash
# Apply the updated collector configuration
kubectl apply -f manifests/infrastructure/openobserve-collector/gateway-collector.yaml
# Restart the collector to pick up the new configuration
kubectl rollout restart statefulset/openobserve-collector-gateway-collector -n openobserve-collector
```
### 6. Verification Steps
1. **Check that the nginx pod is running on the new node**:
```bash
kubectl get pods -n ingress-nginx -o wide | grep NEW_NODE_NAME
```
2. **Verify metrics endpoint is accessible**:
```bash
curl -s http://NEW_NODE_IP:10254/metrics | grep nginx_ingress_controller_requests | head -3
```
3. **Check collector logs for the new target**:
```bash
kubectl logs -n openobserve-collector openobserve-collector-gateway-collector-0 --tail=50 | grep -i nginx
```
4. **Verify target discovery**:
Look for log entries like:
```
Scrape job added {"jobName": "nginx-ingress"}
```
5. **Test metrics in OpenObserve**:
Your dashboard query should now include metrics from the new node:
```promql
sum(increase(nginx_ingress_controller_requests[5m])) by (host)
```
## Important Notes
### Automatic vs Manual Configuration
- ✅ **Automatic**: Nginx ingress controller deployment (DaemonSet handles this)
- ✅ **Automatic**: ServiceMonitor discovery (target allocator handles this)
- ❌ **Manual**: Static scrape configuration (requires updating the targets list)
### Why Both ServiceMonitor and Static Config?
The current setup uses **both approaches** for redundancy:
1. **ServiceMonitor**: Automatically discovers nginx ingress services
2. **Static Configuration**: Ensures specific node IPs are always monitored
### Network Requirements
- Port **10254** must be accessible from the OpenTelemetry collector pods
- The new node should be on the same network as existing nodes
- Host firewall policies should allow metrics collection
### Monitoring Best Practices
- Always verify metrics are flowing after adding a node
- Test your dashboard queries to ensure the new node's metrics appear
- Monitor collector logs for any scraping errors
## Troubleshooting
### Common Issues
1. **Nginx pod not starting**: Check node labels and taints
2. **Metrics endpoint not accessible**: Verify network connectivity and firewall rules
3. **Collector not scraping**: Check collector logs and restart if needed
4. **Missing metrics in dashboard**: Wait 30-60 seconds for metrics to propagate
### Useful Commands
```bash
# Check nginx ingress pods
kubectl get pods -n ingress-nginx -o wide
# Test metrics endpoint
curl -s http://NODE_IP:10254/metrics | grep nginx_ingress_controller_requests
# Check collector status
kubectl get pods -n openobserve-collector
# View collector logs
kubectl logs -n openobserve-collector openobserve-collector-gateway-collector-0 --tail=50
# Check ServiceMonitor
kubectl get servicemonitor -n ingress-nginx -o yaml
```
## Configuration Files Summary
Files that may need updates when adding a node:
1. **Required**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
- Update static targets list (line ~219)
2. **Optional**: `manifests/infrastructure/cluster-policies/host-fw-worker-nodes.yaml`
- Usually already configured for port 10254
3. **Automatic**: `manifests/infrastructure/ingress-nginx/ingress-nginx.yaml`
- No changes needed (DaemonSet handles deployment)

View File

@@ -0,0 +1,39 @@
# Signing up a user with the Authentik workflow
Copy and send the link from the `community-signup-invitation` invitation under the invitations page.
This will allow the user to create an account and go through email verification. From there, they can sign in to write freely.
## Email Template
The community signup email uses a professionally designed welcome template located at:
- **Template File**: `docs/email-templates/community-signup.html`
- **Documentation**: `docs/email-templates/README.md`
The email template includes:
- Keyboard Vagabond branding with horizontal logo
- Welcome message for digital nomads and remote workers
- Account activation button with `{AUTHENTIK_URL}` placeholder
- Overview of all available fediverse services
- Contact information and support links
## Setup Instructions
1. **Access Authentik Dashboard**: Navigate to your Authentik admin interface
2. **Create Invitation Flow**: Go to Flows → Invitations
3. **Upload Template**: Use the HTML template from `docs/email-templates/community-signup.html`
4. **Configure Settings**: Set up email delivery and SMTP credentials
5. **Test Flow**: Send test invitation to verify template rendering
## Services Accessible After Signup
Once users complete the Authentik signup process, they gain access to:
- **Write Freely**: `https://blog.keyboardvagabond.com`
User signup is done within the applications at:
- **Mastodon**: `https://mastodon.keyboardvagabond.com`
- **Pixelfed**: `https://pixelfed.keyboardvagabond.com`
- **BookWyrm**: `https://bookwyrm.keyboardvagabond.com`
- **Piefed**: `https://piefed.keyboardvagabond.com`
Manual account creation must be done for:
- **Picsur**: `https://picsur.keyboardvagabond.com`
Send the community-signup email template

View File

@@ -0,0 +1,352 @@
# VLAN Node-IP Migration Plan
## Document Purpose
This document outlines the plan to migrate Kubernetes node-to-node communication from external IPs to the private VLAN (10.132.0.0/24) for improved performance and security.
## Current State (2025-11-20)
### Cluster Status
- **n1** (control plane): `<NODE_1_EXTERNAL_IP>` - Ready ✅
- **n2** (worker): `<NODE_2_EXTERNAL_IP>` - Ready ✅
- **n3** (worker): `<NODE_3_EXTERNAL_IP>` - Ready ✅
### Current Configuration
All nodes are using **external IPs** for `node-ip`:
- n1: `node-ip: <NODE_1_EXTERNAL_IP>`
- n2: `node-ip: <NODE_2_EXTERNAL_IP>`
- n3: `node-ip: <NODE_3_EXTERNAL_IP>`
### Issues with Current Setup
1. ❌ Inter-node pod traffic uses **public internet** (external IPs)
2. ❌ VLAN bandwidth (100Mbps dedicated) is **unused**
3. ❌ Less secure (traffic exposed on public network)
4. ❌ Potentially slower for inter-pod communication
### What's Working
1. ✅ All nodes joined and operational
2. ✅ Cilium CNI deployed and functional
3. ✅ Global Talos API access enabled (ports 50000, 50001)
4. ✅ GitOps with Flux operational
5. ✅ Core infrastructure recovering
## Goal: VLAN Migration
### Target Configuration
All nodes using **VLAN IPs** for `node-ip`:
- n1: `<NODE_1_IP>` (control plane)
- n2: `<NODE_2_IP>` (worker)
- n3: `<NODE_3_IP>` (worker)
### Benefits
1. ✅ 100Mbps dedicated bandwidth for inter-node traffic
2. ✅ Private network (more secure)
3. ✅ Lower latency for pod-to-pod communication
4. ✅ Production-ready architecture
## Issues Encountered During Initial Attempt
### Issue 1: API Server Endpoint Mismatch
**Problem:**
- `api.keyboardvagabond.com` resolves to n1's external IP (`<NODE_1_EXTERNAL_IP>`)
- Worker nodes with VLAN node-ip couldn't reach API server
- n3 failed to join cluster
**Solution:**
Must choose ONE of:
- **Option A:** Set `cluster.controlPlane.endpoint: https://<NODE_1_IP>:6443` in ALL machine configs
- **Option B:** Update DNS so `api.keyboardvagabond.com` resolves to `<NODE_1_IP>` (VLAN IP)
**Recommended:** Option A (simpler, no DNS changes needed)
### Issue 2: Cluster Lockout After n1 Migration
**Problem:**
- When n1 was changed to VLAN node ip, all access was lost
- Tailscale pods couldn't start (needed API server access)
- Cilium policies blocked external Talos API access
- Complete lockout - no `kubectl` or `talosctl` access
**Root Cause:**
- Tailscale requires API server to be reachable from external network
- Once n1 switched to VLAN-only, Tailscale couldn't connect
- Without Tailscale, no VPN access to cluster
**Solution:**
- ✅ Enabled **global Talos API access** (ports 50000, 50001) in Cilium policies
- This prevents future lockouts during network migrations
### Issue 3: etcd Data Loss After Bootstrap
**Problem:**
- After multiple reboots/config changes, etcd lost its data
- `/var/lib/etcd/member` directory was empty
- etcd stuck waiting to join cluster
**Solution:**
- Ran `talosctl bootstrap` to reinitialize etcd
- GitOps (Flux) automatically redeployed all workloads from Git
- Longhorn has S3 backups for persistent data recovery
### Issue 4: Machine Config Format Issues
**Problem:**
- `machineconfigs/n1.yaml` was in resource dump format (with `spec: |` wrapper)
- YAML indentation errors in various config files
- SOPS encryption complications
**Solution:**
- Use `.decrypted~` files for direct manipulation
- Careful YAML indentation (list items with inline keys)
- Apply configs in maintenance mode with `--insecure` flag
## Migration Plan: Phased VLAN Rollout
### Prerequisites
1. ✅ All nodes in stable, working state (DONE)
2. ✅ Global Talos API access enabled (DONE)
3. ✅ GitOps with Flux operational (DONE)
4. ⏳ Verify Longhorn S3 backups are current
5. ⏳ Document current pod placement and workload state
### Phase 1: Prepare Configurations
#### 1.1 Update Machine Configs for VLAN
For each node, update the machine config:
**n1 (control plane):**
```yaml
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
```
**n2 & n3 (workers):**
```yaml
cluster:
controlPlane:
endpoint: https://<NODE_1_IP>:6443 # Use n1's VLAN IP
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Force VLAN IP selection
```
#### 1.2 Update Cilium Configuration
Verify Cilium is configured to use VLAN interface:
```yaml
# manifests/infrastructure/cilium/release.yaml
values:
kubeProxyReplacement: strict
# Ensure Cilium detects and uses VLAN interface
```
### Phase 2: Test with Worker Node First
#### 2.1 Migrate n3 (Worker Node)
Test VLAN migration on a worker node first:
```bash
# Apply updated config to n3
cd /Users/<USERNAME>/src/keyboard-vagabond
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> apply-config \
--file machineconfigs/n3-vlan.yaml
# Wait for n3 to reboot
sleep 60
# Verify n3 joined with VLAN IP
kubectl get nodes -o wide
# Should show: n3 INTERNAL-IP: <NODE_3_IP>
```
#### 2.2 Validate n3 Connectivity
```bash
# Check Cilium status on n3
kubectl exec -n kube-system ds/cilium -- cilium status
# Verify pod-to-pod communication
kubectl run test-pod --image=nginx --rm -it -- curl <service-on-n3>
# Check inter-node traffic is using VLAN
talosctl -e <NODE_3_EXTERNAL_IP> -n <NODE_3_EXTERNAL_IP> read /proc/net/dev | grep enp9s0
```
#### 2.3 Decision Point
- ✅ If successful: Proceed to Phase 3
- ❌ If issues: Revert n3 to external IP (rollback plan)
### Phase 3: Migrate Second Worker (n2)
Repeat Phase 2 steps for n2:
```bash
talosctl -e <NODE_2_EXTERNAL_IP> -n <NODE_2_EXTERNAL_IP> apply-config \
--file machineconfigs/n2-vlan.yaml
```
Validate connectivity and inter-node traffic on VLAN.
### Phase 4: Migrate Control Plane (n1)
**CRITICAL:** This is the most sensitive step.
#### 4.1 Prepare for Downtime
- ⚠️ **Expected downtime:** 2-5 minutes
- Inform users of maintenance window
- Ensure workers (n2, n3) are stable
#### 4.2 Apply Config to n1
```bash
talosctl -e <NODE_1_EXTERNAL_IP> -n <NODE_1_EXTERNAL_IP> apply-config \
--file machineconfigs/n1-vlan.yaml
```
#### 4.3 Monitor API Server Recovery
```bash
# Watch for API server to come back online
watch -n 2 "kubectl get nodes"
# Check etcd health
talosctl -e <NODE_1_IP> -n <NODE_1_IP> service etcd status
# Verify all nodes on VLAN
kubectl get nodes -o wide
```
### Phase 5: Validation & Verification
#### 5.1 Verify VLAN Traffic
```bash
# Check network traffic on VLAN interface (enp9s0)
for node in <NODE_1_IP> <NODE_2_IP> <NODE_3_IP>; do
echo "=== $node ==="
talosctl -e $node -n $node read /proc/net/dev | grep enp9s0
done
```
#### 5.2 Verify Pod Connectivity
```bash
# Deploy test pods across nodes
kubectl run test-n1 --image=nginx --overrides='{"spec":{"nodeName":"n1"}}'
kubectl run test-n2 --image=nginx --overrides='{"spec":{"nodeName":"n2"}}'
kubectl run test-n3 --image=nginx --overrides='{"spec":{"nodeName":"n3"}}'
# Test cross-node communication
kubectl exec test-n1 -- curl <test-n2-pod-ip>
kubectl exec test-n2 -- curl <test-n3-pod-ip>
```
#### 5.3 Monitor for 24 Hours
- Watch for network issues
- Monitor Longhorn replication
- Check application logs
- Verify external services (Mastodon, Pixelfed, etc.)
## Rollback Plan
### If Issues Occur During Migration
#### Rollback Individual Node
```bash
# Create rollback config with external IP
# Apply to affected node
talosctl -e <node-external-ip> -n <node-external-ip> apply-config \
--file machineconfigs/<node>-external.yaml
```
#### Complete Cluster Rollback
If systemic issues occur:
1. Revert n1 first (control plane is critical)
2. Revert n2 and n3
3. Verify all nodes back on external IPs
4. Investigate root cause before retry
### Emergency Recovery (If Locked Out)
If you lose access during migration:
1. **Access via NetCup Console:**
- Boot node into maintenance mode via NetCup dashboard
- Apply rollback config with `--insecure` flag
2. **Rescue Mode (Last Resort):**
- Boot into NetCup rescue system
- Mount XFS partitions (need `xfsprogs`)
- Manually edit configs (complex, avoid if possible)
## Key Talos Configuration References
### Multihoming Configuration
According to [Talos Multihoming Docs](https://docs.siderolabs.com/talos/v1.10/networking/multihoming):
```yaml
machine:
kubelet:
nodeIP:
validSubnets:
- 10.132.0.0/24 # Selects IP from VLAN subnet
```
### Kubelet node-ip Setting
From [Kubernetes Kubelet Docs](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/):
- `--node-ip`: IP address of the node (can be comma-separated for IPv4/IPv6 dual-stack)
- Controls which IP kubelet advertises to API server
- Determines routing for pod-to-pod traffic
### Network Connectivity Requirements
Per [Talos Network Connectivity Docs](https://docs.siderolabs.com/talos/v1.10/learn-more/talos-network-connectivity/):
**Control Plane Nodes:**
- TCP 50000: apid (used by talosctl, control plane nodes)
- TCP 50001: trustd (used by worker nodes)
**Worker Nodes:**
- TCP 50000: apid (used by control plane nodes)
## Lessons Learned
### What Went Wrong
1. **Incremental migration without proper planning** - Migrated n1 first without considering Tailscale dependencies
2. **Inadequate firewall policies** - Talos API blocked externally, causing lockout
3. **API endpoint mismatch** - DNS resolution didn't match node-ip configuration
4. **Config file format confusion** - Multiple formats caused application errors
### What Went Right
1.**Global Talos API access** - Prevents future lockouts
2.**GitOps with Flux** - Automatic workload recovery after etcd bootstrap
3.**Maintenance mode recovery** - Reliable way to regain access
4.**External IP baseline** - Stable configuration to fall back to
### Best Practices Going Forward
1. **Test on workers first** - Validate VLAN setup before touching control plane
2. **Document all configs** - Keep clear record of working configurations
3. **Monitor traffic** - Use `talosctl read /proc/net/dev` to verify VLAN usage
4. **Backup etcd** - Regular etcd backups to avoid data loss
5. **Plan for downtime** - Maintenance windows for control plane changes
## Success Criteria
Migration is successful when:
1. ✅ All nodes showing VLAN IPs in `kubectl get nodes -o wide`
2. ✅ Inter-node traffic flowing over enp9s0 (VLAN interface)
3. ✅ All pods healthy and communicating
4. ✅ Longhorn replication working
5. ✅ External services (Mastodon, Pixelfed, etc.) operational
6. ✅ No performance degradation
7. ✅ 24-hour stability test passed
## Additional Resources
- [Talos Multihoming Documentation](https://docs.siderolabs.com/talos/v1.10/networking/multihoming)
- [Talos Production Notes](https://docs.siderolabs.com/talos/v1.10/getting-started/prodnotes)
- [Kubernetes Kubelet Reference](https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/)
- [Cilium Documentation](https://docs.cilium.io/)
## Contact & Maintenance
**Last Updated:** 2025-11-20
**Cluster:** keyboardvagabond.com
**Status:** Nodes operational on external IPs, VLAN migration pending

265
docs/ZeroTrustMigration.md Normal file
View File

@@ -0,0 +1,265 @@
# Migrating from External DNS to CF Zero Trust
Now that the CF domain is set up, it's time to move other apps and services to using it, then to potentially seal off
as much of the Talos and k8s ports as I can.
## Zero-Downtime Migration Process
### Step 1: Discover Service Configuration
```bash
# Find service name and port
kubectl get svc -n <namespace>
# Example output: service-name ClusterIP 10.x.x.x <none> 9898/TCP
```
### Step 2: Create Tunnel Route (FIRST!)
1. Go to **Cloudflare Zero Trust Dashboard****Networks****Tunnels**
2. Find your tunnel, click **Configure**
3. Add **Public Hostname**:
- **Subdomain**: `app`
- **Domain**: `keyboardvagabond.com`
- **Service**: `http://service-name.namespace.svc.cluster.local:port`
4. **Test** the tunnel URL works before proceeding!
### Step 3: Update Application Configuration
Clear external-DNS annotations and TLS configuration:
```yaml
# In Helm values or ingress manifest:
ingress:
annotations: {} # Explicitly empty - removes cert-manager and external-dns
tls: [] # Explicitly empty array - no certificates needed
```
### Step 4: Deploy Changes
```bash
# For Helm apps via Flux:
flux reconcile helmrelease <app-name> -n <namespace>
# For direct manifests:
kubectl apply -f <manifest-file>
```
### Step 5: Clean Up Certificates
```bash
# Delete certificate resources
kubectl delete certificate <cert-name> -n <namespace>
# Find and delete TLS secrets
kubectl get secrets -n <namespace> | grep tls
kubectl delete secret <tls-secret-name> -n <namespace>
```
### Step 6: Verify Clean State
```bash
# Check no new certificates are being created
kubectl get certificate,secret -n <namespace> | grep <app-name>
# Should only show Helm release secrets, no certificate or TLS secrets
```
### Step 7: DNS Record Management
**How it works:**
- **Tunnel automatically creates**: CNAME record → `tunnel-id.cfargotunnel.com`
- **External-DNS created**: A records → your cluster IPs
- **DNS Priority**: CNAME takes precedence over A records
**Cleanup options:**
```bash
# Option 1: Auto-cleanup (recommended) - wait 5 minutes after removing annotations
# External-DNS will automatically delete A records after TTL expires
# Option 2: Manual cleanup (immediate)
# Go to Cloudflare DNS dashboard and manually delete A records
# Keep the CNAME record (created by tunnel)
```
**Verification:**
```bash
# Check DNS resolution shows CNAME (not A records)
dig podinfo.keyboardvagabond.com
# Should show:
# podinfo.keyboardvagabond.com. CNAME tunnel-id.cfargotunnel.com.
```
## Rollback Plan
If tunnel doesn't work:
1. **Revert** Helm values/manifests (add back annotations and TLS)
2. **Redeploy**: `flux reconcile` or `kubectl apply`
3. **Wait** for cert-manager to recreate certificates
## Benefits After Migration
-**No exposed public IPs** - cluster nodes not directly accessible
-**Automatic DDoS protection** via Cloudflare
-**Centralized SSL management** - Cloudflare handles certificates
-**Better observability** - Cloudflare analytics and logs
**It should work!** 🚀 (And now we have a plan if it doesn't!)
## Advanced: Securing Administrative Access
### Securing Kubernetes & Talos APIs
Once application migration is complete, you can secure administrative access:
#### Option 1: TCP Proxy (Simpler)
```yaml
# Cloudflare Zero Trust → Tunnels → Configure
Public Hostname:
Subdomain: api
Domain: keyboardvagabond.com
Service: tcp://localhost:6443 # Kubernetes API
Public Hostname:
Subdomain: talos
Domain: keyboardvagabond.com
Service: tcp://<NODE_1_IP>:50000 # Talos API
```
**Client configuration:**
```bash
# Update kubectl config
kubectl config set-cluster keyboardvagabond \
--server=https://api.keyboardvagabond.com:443 # Note: 443, not 6443
# Update talosctl config
talosctl config endpoint talos.keyboardvagabond.com:443
```
#### Option 2: Private Network via WARP (Most Secure)
**Step 1: Configure Private Network**
```yaml
# Cloudflare Zero Trust → Tunnels → Configure → Private Networks
Private Network:
CIDR: 10.132.0.0/24 # Your NetCup vLAN network
Description: "Keyboard Vagabond Cluster Internal Network"
```
**Step 2: Configure Split Tunnels**
```yaml
# Zero Trust → Settings → WARP Client → Device settings → Split Tunnels
Mode: Exclude (recommended)
Remove: 10.0.0.0/8 # Remove broad private range
Add back:
- 10.0.0.0/9 # 10.0.0.0 - 10.127.255.255
- 10.133.0.0/16 # 10.133.0.0 - 10.133.255.255
- 10.134.0.0/15 # 10.134.0.0 - 10.135.255.255
# This ensures only 10.132.0.0/24 routes through WARP
```
**Step 3: Client Configuration**
```bash
# Install WARP client on admin machines
# macOS: brew install --cask cloudflare-warp
# Connect to Zero Trust organization
warp-cli registration new
# Configure kubectl to use internal IPs
kubectl config set-cluster keyboardvagabond \
--server=https://<NODE_1_IP>:6443 # Direct to internal node IP
# Configure talosctl to use internal IPs
talosctl config endpoint <NODE_1_IP>:50000,<NODE_2_IP>:50000
```
**Step 4: Access Policies (Recommended)**
```yaml
# Zero Trust → Access → Applications → Add application
Application Type: Private Network
Name: "Kubernetes Cluster Admin Access"
Application Domain: 10.132.0.0/24
Policies:
- Name: "Admin Team Only"
Action: Allow
Rules:
- Email domain: @yourdomain.com
- Device Posture: Managed device required
```
**Step 5: Device Enrollment**
```bash
# On admin device
# 1. Install WARP: https://1.1.1.1/
# 2. Login with Zero Trust organization
# 3. Verify private network access:
ping <NODE_1_IP> # Should work through WARP
# 4. Test API access
kubectl get nodes # Should connect to internal cluster
talosctl version # Should connect to internal Talos API
```
**Step 6: Lock Down External Access**
Once WARP is working, update Talos machine configs to block external access:
```yaml
# In machineconfigs/n1.yaml and n2.yaml
machine:
network:
extraHostEntries:
# Firewall rules via Talos
- ip: 127.0.0.1 # Placeholder - actual firewall config needed
```
#### WARP Benefits:
-**No public DNS entries** - Admin endpoints not discoverable
-**Device control** - Only managed devices can access cluster
-**Zero-trust policies** - Granular access control per user/device
-**Audit logs** - Full visibility into who accessed what when
-**Device posture** - Require encryption, OS updates, etc.
-**Split tunneling** - Only cluster traffic goes through tunnel
-**Automatic failover** - Multiple WARP data centers
## Testing WARP Implementation
### Before WARP (Current State)
```bash
# Current kubectl configuration
kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}'
# Output: https://api.keyboardvagabond.com:6443
# This goes through internet → external IPs
kubectl get nodes
```
### After WARP Setup
```bash
# 1. Test private network connectivity first
ping <NODE_1_IP> # Should work once WARP is connected
# 2. Create backup kubectl context
kubectl config set-context keyboardvagabond-external \
--cluster=keyboardvagabond.com \
--user=admin@keyboardvagabond.com
# 3. Update main context to use internal IP
kubectl config set-cluster keyboardvagabond.com \
--server=https://<NODE_1_IP>:6443
# 4. Test internal access
kubectl get nodes # Should work through WARP → private network
# 5. Verify traffic path
# WARP status should show "Connected" in system tray
warp-cli status # Should show connected to your Zero Trust org
```
### Rollback Plan
```bash
# If WARP doesn't work, quickly restore external access:
kubectl config set-cluster keyboardvagabond.com \
--server=https://api.keyboardvagabond.com:6443
# Test external access still works
kubectl get nodes
```
## Next Steps After WARP
Once WARP is proven working:
1. **Configure Talos firewall** to block external access to ports 6443 and 50000
2. **Remove public API DNS entry** (api.keyboardvagabond.com)
3. **Document emergency access procedure** (temporary firewall rule + external DNS)
4. **Set up additional WARP devices** for other administrators
This gives you a **zero-trust administrative access model** where cluster APIs are completely invisible from the internet! 🔒

View File

@@ -0,0 +1,493 @@
# OpenObserve Dashboard PromQL Queries
This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application.
## Metric Sources
Your cluster has multiple metric sources:
1. **OpenTelemetry spanmetrics** - Generates metrics from traces (`calls_total`, `latency`)
2. **Ingress-nginx** - HTTP request metrics at the ingress layer
3. **Application metrics** - Direct metrics from applications (Mastodon, BookWyrm, etc.)
## Applications
- **Mastodon** (`mastodon-application`)
- **Pixelfed** (`pixelfed-application`)
- **PieFed** (`piefed-application`)
- **BookWyrm** (`bookwyrm-application`)
- **Picsur** (`picsur`)
- **Write Freely** (`write-freely`)
---
## 1. Requests Per Second (RPS) by Application
### Using Ingress-Nginx Metrics (Recommended - Most Reliable)
```promql
# Total RPS by application (via ingress)
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# RPS by application and status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
# RPS by application and HTTP method
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method)
# RPS for specific applications
sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# RPS from spanmetrics (if service names are properly labeled)
sum(rate(calls_total[5m])) by (service_name)
# RPS by application namespace (if k8s attributes are present)
sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name)
# RPS by application and HTTP method
sum(rate(calls_total[5m])) by (service_name, http.method)
# RPS by application and status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)
```
### Combined View (All Applications)
```promql
# All applications RPS
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
```
---
## 2. Request Duration by Application
### Using Ingress-Nginx Metrics
```promql
# Average request duration by application
sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace)
# P50 (median) request duration
histogram_quantile(0.50,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P95 request duration
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P99 request duration
histogram_quantile(0.99,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P99.9 request duration (for tail latency)
histogram_quantile(0.999,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# Max request duration
max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# Average latency from spanmetrics
sum(rate(latency_sum[5m])) by (service_name)
/
sum(rate(latency_count[5m])) by (service_name)
# P50 latency
histogram_quantile(0.50,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# P95 latency
histogram_quantile(0.95,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# P99 latency
histogram_quantile(0.99,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# Latency by HTTP method
histogram_quantile(0.95,
sum(rate(latency_bucket[5m])) by (service_name, http.method, le)
)
```
### Response Duration (Backend Processing Time)
```promql
# Average backend response duration
sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace)
# P95 backend response duration
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
```
---
## 3. Success Rate by Application
### Using Ingress-Nginx Metrics
```promql
# Success rate (2xx / total requests) by application
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# Success rate as percentage
(
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100
# Error rate (4xx + 5xx) by application
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# Error rate as percentage
(
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100
# Breakdown by status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
# 5xx errors specifically
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# Success rate from spanmetrics
sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)
# Error rate from spanmetrics
sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)
# Breakdown by status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)
```
---
## 4. Additional Best Practice Metrics
### Request Volume Trends
```promql
# Requests per minute (for trend analysis)
sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60
# Total requests in last hour
sum(increase(nginx_ingress_controller_requests[1h])) by (namespace)
```
### Top Endpoints
```promql
# Top endpoints by request volume
topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path))
# Top slowest endpoints (P95)
topk(10,
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le)
)
)
```
### Error Analysis
```promql
# 4xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status)
# 5xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status)
# Error rate trend (detect spikes)
rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])
```
### Throughput Metrics
```promql
# Bytes sent per second
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
# Bytes received per second
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
# Total bandwidth usage
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
+
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
```
### Connection Metrics
```promql
# Active connections
sum(nginx_ingress_controller_connections) by (ingress, namespace, state)
# Connection rate
sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state)
```
### Application-Specific Metrics
#### Mastodon
```promql
# Mastodon-specific metrics (if exposed)
sum(rate(mastodon_http_requests_total[5m])) by (method, status)
sum(rate(mastodon_http_request_duration_seconds[5m])) by (method)
```
#### BookWyrm
```promql
# BookWyrm-specific metrics (if exposed)
sum(rate(bookwyrm_requests_total[5m])) by (method, status)
```
### Database Connection Metrics (PostgreSQL)
```promql
# Active database connections by application
pg_application_connections{state="active"}
# Total connections by application
sum(pg_application_connections) by (app_name)
# Connection pool utilization
sum(pg_application_connections) by (app_name) / 100 # Adjust divisor based on max connections
```
### Celery Queue Metrics
```promql
# Queue length by application
sum(celery_queue_length{queue_name!="_total"}) by (database)
# Queue processing rate
sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60
# Stalled queues (no change in 15 minutes)
changes(celery_queue_length{queue_name="_total"}[15m]) == 0
and celery_queue_length{queue_name="_total"} > 100
```
#### Redis-Backed Queue Dashboard Panels
Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the `celery-metrics-exporter` in the `celery-monitoring` namespace.
- **Queue Depth per Queue (stacked area or line)**
```promql
sum by (database, queue_name) (
celery_queue_length{
queue_name!~"_total|_staging",
database=~"piefed|bookwyrm|mastodon"
}
)
```
This shows the absolute number of pending items in every discovered queue. Filter the `database` regex if you only want a single app. Switch the panel legend to `{{database}}/{{queue_name}}` so per-queue trends stand out.
- **Processing Rate per Queue (tasks/minute)**
```promql
-60 * sum by (database, queue_name) (
rate(
celery_queue_length{
queue_name!~"_total|_staging",
database=~"piefed|bookwyrm|mastodon"
}[5m]
)
)
```
The queue length decreases when workers drain tasks, so multiply the `rate()` by `-60` to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck.
> **Fallback**: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric `redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}`. Replace `celery_queue_length` with `redis_list_length` in both queries and keep the rest of the panel configuration identical.
An import-ready OpenObserve dashboard that contains these two panels lives at `docs/dashboards/openobserve-redis-queue-dashboard.json`. Import it via *Dashboards → Import* to jump-start the rebuild after a disaster recovery.
### Redis Metrics
```promql
# Redis connection status
redis_connection_status
# Redis memory usage (if available)
redis_memory_used_bytes
```
### Pod/Container Metrics
```promql
# CPU usage by application
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
# Memory usage by application
sum(container_memory_working_set_bytes) by (namespace, pod)
# Pod restarts
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod)
```
---
## 5. Dashboard Panel Recommendations
### Panel 1: Overview
- **Total RPS** (all applications)
- **Total Error Rate** (all applications)
- **Average Response Time** (P95, all applications)
### Panel 2: Per-Application RPS
- Time series graph showing RPS for each application
- Use `sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)`
### Panel 3: Per-Application Latency
- P50, P95, P99 latency for each application
- Use histogram quantiles from ingress-nginx metrics
### Panel 4: Success/Error Rates
- Success rate (2xx) by application
- Error rate (4xx + 5xx) by application
- Status code breakdown
### Panel 5: Top Endpoints
- Top 10 endpoints by volume
- Top 10 slowest endpoints
### Panel 6: Database Health
- Active connections by application
- Connection pool utilization
### Panel 7: Queue Health (Celery)
- Queue lengths by application
- Processing rates
### Panel 8: Resource Usage
- CPU usage by application
- Memory usage by application
- Pod restart counts
---
## 6. Alerting Queries
### High Error Rate
```promql
# Alert if error rate > 5% for any application
(
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) > 0.05
```
### High Latency
```promql
# Alert if P95 latency > 2 seconds
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le)
) > 2
```
### Low Success Rate
```promql
# Alert if success rate < 95%
(
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) < 0.95
```
### High Request Volume (Spike Detection)
```promql
# Alert if RPS increases by 3x in 5 minutes
rate(nginx_ingress_controller_requests[5m])
>
3 * rate(nginx_ingress_controller_requests[5m] offset 5m)
```
---
## 7. Notes on Metric Naming
- **Ingress-nginx metrics** are the most reliable for HTTP request metrics
- **spanmetrics** may have different label names depending on k8s attribute processor configuration
- Check actual metric names in OpenObserve using: `{__name__=~".*request.*|.*http.*|.*latency.*"}`
- Service names from spanmetrics may need to be mapped to application names
## 8. Troubleshooting
If metrics don't appear:
1. **Check ServiceMonitors are active:**
```bash
kubectl get servicemonitors -A
```
2. **Verify Prometheus receiver is scraping:**
Check OpenTelemetry collector logs for scraping errors
3. **Verify metric names:**
Query OpenObserve for available metrics:
```promql
{__name__=~".*"}
```
4. **Check label names:**
The actual label names may vary. Common variations:
- `namespace` vs `k8s.namespace.name`
- `service_name` vs `service.name`
- `ingress` vs `ingress_name`
---
## Quick Reference: Application Namespaces
- Mastodon: `mastodon-application`
- Pixelfed: `pixelfed-application`
- PieFed: `piefed-application`
- BookWyrm: `bookwyrm-application`
- Picsur: `picsur`
- Write Freely: `write-freely`

87
docs/theme-digest.md Normal file
View File

@@ -0,0 +1,87 @@
# Keyboard Vagabond
A collection of fediverse applications for the nomad and travel niche given as a donation for a better internet.
The applications are Mastodon (Twitter), Pixelfed (Instagram), PieFed / Lemmy (Reddit), Write Freely (blogging), Bookwyrm (book reviews), Matrix (chat / slack), (some wiki, possibly).
Right now I'm still setting up these services, so it's not ready for launch. I do want to include a general landing page at some point with basic information about the site and fediverse.
I'll likely handle that, as it should be a basic static website with 2-3 pages with the ability to sign in.
I would like to create a mascot and background banners with a common theme. The base websites tend to choose an animal as a theme, so I think a similar, cute animal for a mascot that's themed for each site would be fun. The current apps use Lemmings and a Mastodon, so I'm thinking a similar animal that would work for travel and adventure.
## The Fediverse
The fediverse is the online world of federated services that all speak the same protocol and can interact with each other, like email.
There is no corporation in charge, just servers that talk with each other by people, for people. Like email, there are different servers or "instances" that you can sign up with.
Unlike regular social media, users on different applications can interact with each other, so someone can make a post on Mastodon and mention a community on Lemmy, to which they can reply.
This video is a great explanation of the Fediverse: https://videos.elenarossini.com/w/64VuNCccZNrP4u9MfgbhkN.
## The Feeling
I'd like to have a more fun feeling that leans toward adventurous while avoiding feeling too serious, though the topics may also be serious.
I could use help picking tones or pallettes, the visual style, as well as the direction for the animal mascot.
## The Goal of Keyboard Vagabond
To create a welcoming space in the fediverse for people to share and connect with the niche of travel, but without the corporate manipulations that come with sites like Reddit and X.
Here is the latest about page for the keyboard mastodon instance: https://mastodon.keyboardvagabond.com/about.
Here are some other reference sites from bigger instances:
* The About: https://mastodon.social/about, Main Page: https://mastodon.world/explore
* https://pixelfed.social (click About and Explore)
* https://piefed.social
* https://bookwyrm.social
* My personal blog: https://blog.<DOMAIN> for Write Freely
These sevices generally support custom mascot icons and background banners. Themeing and custom CSS has varying degrees of support, though I have full access to the server, so I could override the built in CSS, though that could likely be an endeavor, which I'm not user would be worth the effort.
I think one of the more fun things would be to have a mascot character themed for the different applications, maybe something like "with a camera" for Pixelfed, or a book for bookwyrm.
## Main Goals:
- Have a mascot with variations for the site. The fediverse apps often favor some kind of animal. Lemmy uses a Lemming, Mastodon a Mastodon. Some similar kind of animal would be fun.
- A background banner, themed for each website.
- An icon for the "no profle picture" default
This would likely result in something that looks like:
* Mastodon - mascot icon, mascot "empty image", background banner
* PieFed - mascot icon, mascot "empty image", background banner
* Pixelfed - mascot icon, mascot "empty image", background banner
* Write Freely - Limited customization, but an icon with either the WriteFreely "W" or something like a pen should be something I could work in
* Bookwyrm - I haven't even looked at this app yet, I just like the idea, but a mascot with glasses or book
## What we may need to work out
- The mascot character (fun and adventurous feeling)
- Pallettes and tones. Customization across the apps may be limited, so the colors might mainly apply to just the banner and icons.
- How to get the theme and feel to create a fun character/theme.
**Bonus**
- 404 (not found) and 500 (Server Error) page assets. I'm only just thinking of this, but it's low priority.
Main Goals:
- Have a mascot with variations for the site. The fediverse apps often favor some kind of animal. Lemmy uses a Lemming, Mastodon a Mastodon. Some similar kind of animal would be fun.
- A background banner, themed for each website.
- An icon for the "no profle picture" default
This would likely result in something that looks like:
* Mastodon - mascot icon, mascot "empty image", background banner
* PieFed - mascot icon, mascot "empty image", background banner
* Pixelfed - mascot icon, mascot "empty image", background banner
* Write Freely - Limited customization, but an icon with either the WriteFreely "W" or something like a pen should be something I could work in
* Bookwyrm - I haven't even looked at this app yet, I just like the idea, but a mascot with glasses or book
What may be in the final
- 1 main mascot design (base character)
- 5 mascot variations (themed for each app)
- 3-4 background banners (adapted for different apps)
- 3-5 default profile images total (one for the main apps of Mastodon, Pixelfed, and Piefed)
- 1 main logo/wordmark for Keyboard Vagabond
- (possibly something for the landing website)
Ideal formats would be SVG, PNG, JPG. I can handle resizing and all that fun stuff.
Some places it would get used would be:
Sizes likely used:
- Favicon: 32x32, 16x16
- App icons: 512x512, 256x256, 128x128
- Profile defaults: 200x200, 400x400
- Background banners: 1500x500, 1920x600