add source code and readme
This commit is contained in:
58
.cursor/rules/00-project-overview.mdc
Normal file
58
.cursor/rules/00-project-overview.mdc
Normal file
@@ -0,0 +1,58 @@
|
||||
---
|
||||
description: Keyboard Vagabond project overview and core infrastructure context
|
||||
globs: []
|
||||
alwaysApply: true
|
||||
---
|
||||
|
||||
# Keyboard Vagabond - Project Overview
|
||||
|
||||
## System Overview
|
||||
This is a **Talos-based Kubernetes cluster** designed to host **fediverse applications** for <200 MAU (Monthly Active Users):
|
||||
- **Mastodon** (Twitter-like microblogging) ✅ OPERATIONAL
|
||||
- **Pixelfed** (Instagram-like photo sharing) ✅ OPERATIONAL
|
||||
- **PieFed** (Reddit-like forum) ✅ OPERATIONAL
|
||||
- **BookWyrm** (Social reading platform) ✅ OPERATIONAL
|
||||
- **Matrix** (Chat/messaging) - Future deployment
|
||||
|
||||
## Architecture Summary ✅ OPERATIONAL
|
||||
- **Three ARM64 Nodes**: n1, n2, n3 (all control plane nodes with VIP 10.132.0.5)
|
||||
- **Zero Trust Security**: Cloudflare tunnels + Tailscale mesh VPN
|
||||
- **Storage**: Longhorn distributed with S3 backup to Backblaze B2
|
||||
- **Database**: PostgreSQL HA cluster with CloudNativePG operator
|
||||
- **Cache**: Redis HA cluster with HAProxy (redis-ha-haproxy.redis-system.svc.cluster.local)
|
||||
- **Monitoring**: OpenTelemetry + OpenObserve (O2)
|
||||
- **Registry**: Harbor container registry
|
||||
- **CDN**: Per-application Cloudflare CDN with dedicated S3 buckets
|
||||
|
||||
## Project Structure
|
||||
```
|
||||
keyboard-vagabond/
|
||||
├── .cursor/rules/ # Cursor rules (this directory)
|
||||
├── docs/ # Operational documentation and guides
|
||||
├── manifests/ # Kubernetes manifests
|
||||
│ ├── infrastructure/ # Core infrastructure components
|
||||
│ ├── applications/ # Fediverse applications
|
||||
│ └── cluster/flux-system/ # GitOps configuration
|
||||
├── build/ # Custom container builds
|
||||
├── machineconfigs/ # Talos node configurations
|
||||
└── tools/ # Development utilities
|
||||
```
|
||||
|
||||
## Rule Organization
|
||||
The `.cursor/rules/` directory contains specialized rules:
|
||||
- **00-project-overview.mdc** (this file) - Always applied project context
|
||||
- **infrastructure.mdc**: Auto-attached when working in `manifests/infrastructure/`
|
||||
- **applications.mdc**: Auto-attached when working in `manifests/applications/`
|
||||
- **security.mdc**: SOPS and Zero Trust patterns (auto-attached for YAML files)
|
||||
- **development.mdc**: Development patterns and operational guidelines
|
||||
- **troubleshooting-history.mdc**: Historical issues, migrations, and lessons learned
|
||||
- **templates/**: Common configuration templates (*.yaml files)
|
||||
|
||||
## Key Operational Facts
|
||||
- **Domain**: `keyboardvagabond.com`
|
||||
- **API Endpoint**: `api.keyboardvagabond.com:6443` (Tailscale-only access)
|
||||
- **Control Plane VIP**: `10.132.0.5:6443` (nodes elect primary, VIP provides HA)
|
||||
- **Zero Trust**: All external services via Cloudflare tunnels (no port exposure)
|
||||
- **Network**: NetCup Cloud vLAN 1004963 (10.132.0.0/24)
|
||||
- **Security**: Enterprise-grade with SOPS encryption, mesh VPN, host firewall
|
||||
- **Status**: Fully operational, production-ready cluster
|
||||
124
.cursor/rules/applications.mdc
Normal file
124
.cursor/rules/applications.mdc
Normal file
@@ -0,0 +1,124 @@
|
||||
---
|
||||
description: Fediverse applications deployment patterns and configurations
|
||||
globs: ["manifests/applications/**/*", "build/**/*"]
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Fediverse Applications ✅ OPERATIONAL
|
||||
|
||||
## Application Overview
|
||||
All applications use **Zero Trust architecture** via Cloudflare tunnels with dedicated S3 buckets for media storage:
|
||||
|
||||
### Currently Deployed Applications
|
||||
- **Mastodon**: `https://mastodon.keyboardvagabond.com` - Microblogging platform ✅ OPERATIONAL
|
||||
- **Pixelfed**: `https://pixelfed.keyboardvagabond.com` - Photo sharing platform ✅ OPERATIONAL
|
||||
- **PieFed**: `https://piefed.keyboardvagabond.com` - Forum/Reddit-like platform ✅ OPERATIONAL
|
||||
- **BookWyrm**: `https://bookwyrm.keyboardvagabond.com` - Social reading platform ✅ OPERATIONAL
|
||||
- **Picsur**: `https://picsur.keyboardvagabond.com` - Image storage ✅ OPERATIONAL
|
||||
|
||||
## Application Architecture Patterns
|
||||
|
||||
### Multi-Container Design
|
||||
Most fediverse applications use **multi-container architecture**:
|
||||
- **Web Container**: HTTP requests, API, web UI (Nginx + app server)
|
||||
- **Worker Container**: Background jobs, federation, media processing
|
||||
- **Beat Container**: (Django apps only) Celery Beat scheduler for periodic tasks
|
||||
|
||||
### Storage Strategy ✅ OPERATIONAL
|
||||
**Per-Application CDN Strategy**: Each application uses dedicated Backblaze B2 bucket with Cloudflare CDN:
|
||||
- **Pixelfed CDN**: `pm.keyboardvagabond.com` → `pixelfed-bucket`
|
||||
- **PieFed CDN**: `pfm.keyboardvagabond.com` → `piefed-bucket`
|
||||
- **Mastodon CDN**: `mm.keyboardvagabond.com` → `mastodon-bucket`
|
||||
- **BookWyrm CDN**: `bm.keyboardvagabond.com` → `bookwyrm-bucket`
|
||||
|
||||
### Database Integration
|
||||
All applications use the shared **PostgreSQL HA cluster**:
|
||||
- **Connection**: `postgresql-shared-rw.postgresql-system.svc.cluster.local:5432`
|
||||
- **Dedicated Databases**: Each app has its own database (e.g., `mastodon`, `pixelfed`, `piefed`, `bookwyrm`)
|
||||
- **High Availability**: 3-instance cluster with automatic failover
|
||||
|
||||
## Framework-Specific Patterns
|
||||
|
||||
### Laravel Applications (Pixelfed)
|
||||
```yaml
|
||||
# Critical Laravel S3 Configuration
|
||||
FILESYSTEM_DRIVER=s3
|
||||
PF_ENABLE_CLOUD=true
|
||||
FILESYSTEM_CLOUD=s3
|
||||
AWS_BUCKET=pixelfed-bucket # Dedicated bucket approach
|
||||
AWS_URL=https://pm.keyboardvagabond.com/ # CDN URL
|
||||
```
|
||||
|
||||
### Flask Applications (PieFed)
|
||||
```yaml
|
||||
# Flask Configuration with Redis and S3
|
||||
FLASK_APP=pyfedi.py
|
||||
DATABASE_URL=
|
||||
CACHE_REDIS_URL=
|
||||
S3_BUCKET=
|
||||
S3_PUBLIC_URL=https://pfm.keyboardvagabond.com
|
||||
```
|
||||
|
||||
### Django Applications (BookWyrm)
|
||||
```yaml
|
||||
# Django S3 Configuration
|
||||
USE_S3=true
|
||||
AWS_STORAGE_BUCKET_NAME=bookwyrm-bucket
|
||||
AWS_S3_CUSTOM_DOMAIN=bm.keyboardvagabond.com
|
||||
AWS_DEFAULT_ACL="" # Backblaze B2 doesn't support ACLs
|
||||
```
|
||||
|
||||
### Ruby Applications (Mastodon)
|
||||
```yaml
|
||||
# Mastodon Dual Ingress Pattern
|
||||
# Web: mastodon.keyboardvagabond.com
|
||||
# Streaming: streamingmastodon.keyboardvagabond.com (WebSocket)
|
||||
STREAMING_API_BASE_URL: wss://streamingmastodon.keyboardvagabond.com
|
||||
```
|
||||
|
||||
## Container Build Patterns
|
||||
|
||||
### Multi-Stage Docker Strategy ✅ WORKING
|
||||
Optimized builds reduce image size by ~75%:
|
||||
- **Base Image**: Shared foundation with dependencies and source code
|
||||
- **Web Container**: Production web server configuration
|
||||
- **Worker Container**: Background processing optimizations
|
||||
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage
|
||||
|
||||
### Harbor Registry Integration
|
||||
- **Registry**: `<YOUR_REGISTRY_URL>`
|
||||
- **Image Pattern**: `<YOUR_REGISTRY_URL>/library/app-name:tag`
|
||||
- **Build Process**: `./build-all.sh` in project root
|
||||
|
||||
## ActivityPub Inbox Rate Limiting ✅ OPERATIONAL
|
||||
|
||||
### Nginx Burst Configuration Pattern
|
||||
Implemented across all fediverse applications to handle federation traffic spikes:
|
||||
```nginx
|
||||
# Rate limiting zone - 100MB buffer, 10 requests/second
|
||||
limit_req_zone $binary_remote_addr zone=inbox:100m rate=10r/s;
|
||||
|
||||
# ActivityPub inbox location block
|
||||
location /inbox {
|
||||
limit_req zone=inbox burst=300; # 300 request buffer
|
||||
# Extended timeouts for ActivityPub processing
|
||||
}
|
||||
```
|
||||
|
||||
### Rate Limiting Behavior
|
||||
- **Normal Operation**: 10 requests/second processed immediately
|
||||
- **Burst Handling**: Up to 300 additional requests queued
|
||||
- **Overflow Response**: HTTP 503 when buffer exceeds capacity
|
||||
- **Federation Impact**: Protects backend from overwhelming traffic spikes
|
||||
|
||||
## Application Deployment Standards
|
||||
- **Zero Trust Ingress**: All applications use Cloudflare tunnel pattern
|
||||
- **Container Registry**: Harbor for all custom images
|
||||
- **Multi-Stage Builds**: Required for Python/Node.js applications
|
||||
- **Storage**: Longhorn with 2-replica redundancy
|
||||
- **Monitoring**: ServiceMonitor integration with OpenObserve
|
||||
- **Rate Limiting**: ActivityPub inbox protection for all fediverse apps
|
||||
|
||||
@fediverse-app-template.yaml
|
||||
@s3-storage-config-template.yaml
|
||||
@activitypub-rate-limiting-template.yaml
|
||||
140
.cursor/rules/development.mdc
Normal file
140
.cursor/rules/development.mdc
Normal file
@@ -0,0 +1,140 @@
|
||||
---
|
||||
description: Development patterns, operational guidelines, and troubleshooting
|
||||
globs: ["build/**/*", "tools/**/*", "justfile", "*.md"]
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Development Patterns & Operational Guidelines
|
||||
|
||||
## Configuration Management
|
||||
- **Kustomize**: Used for resource composition and patching via `patches/` directory
|
||||
- **Helm**: Complex applications deployed via HelmRelease CRDs
|
||||
- **GitOps**: All applications deployed via Flux from Git repository (`k8s-fleet` branch)
|
||||
- **Staging**: Use separate branches/overlays for staging vs production environments
|
||||
|
||||
## Application Deployment Standards
|
||||
- **Container Registry**: Use Harbor (`<YOUR_REGISTRY_URL>`) for all custom images
|
||||
- **Multi-Stage Builds**: Implement for Python/Node.js applications to reduce image size by ~75%
|
||||
- **Storage**: Use Longhorn with 2-replica redundancy, label volumes for S3 backup selection
|
||||
- **Database**: Leverage shared PostgreSQL cluster with dedicated databases per application
|
||||
- **Monitoring**: Implement ServiceMonitor for OpenObserve integration
|
||||
|
||||
## Email Templates & User Onboarding
|
||||
- **Community Signup**: Professional welcome email template at `docs/email-templates/community-signup.html`
|
||||
- **Authentik Integration**: Uses `{AUTHENTIK_URL}` placeholder for account activation links
|
||||
- **Documentation**: Complete setup guide in `docs/email-templates/README.md`
|
||||
- **Services Overview**: Template showcases all fediverse services with direct links
|
||||
- **Branding**: Features horizontal Keyboard Vagabond logo from Picsur CDN
|
||||
- **Rate Limiting**: Implement ActivityPub inbox burst protection for all fediverse applications
|
||||
|
||||
## Container Build Patterns
|
||||
|
||||
### Multi-Stage Docker Strategy ✅ WORKING
|
||||
**Key Lessons Learned**:
|
||||
- **Framework Identification**: Critical to identify Flask vs Django early (different command structures)
|
||||
- **Python Virtual Environment**: uWSGI must use same Python version as venv
|
||||
- **Static File Paths**: Flask apps with application factory have nested structure (`/app/app/static/`)
|
||||
- **Database Initialization**: Flask requires explicit `flask init-db` command
|
||||
- **Log File Permissions**: Non-root users need explicit ownership of log files
|
||||
|
||||
### Build Process
|
||||
```bash
|
||||
# Build all containers
|
||||
./build-all.sh
|
||||
|
||||
# Build specific application
|
||||
cd build/app-name
|
||||
docker build -t <YOUR_REGISTRY_URL>/library/app-name:tag .
|
||||
docker push <YOUR_REGISTRY_URL>/library/app-name:tag
|
||||
```
|
||||
|
||||
## Key Framework Patterns
|
||||
|
||||
### Flask Applications (PieFed)
|
||||
- **Environment Variables**: URL-based configuration (DATABASE_URL, REDIS_URL)
|
||||
- **uWSGI Integration**: Install via pip in venv, not Alpine packages
|
||||
- **Static Files**: Careful nginx configuration for nested structure
|
||||
- **Multi-stage Builds**: Essential to remove build dependencies
|
||||
|
||||
### Django Applications (BookWyrm)
|
||||
- **S3 Static Files**: Theme compilation before static collection
|
||||
- **Celery Beat**: Single instance only (prevents duplicate scheduling)
|
||||
- **ACL Configuration**: Backblaze B2 requires empty `AWS_DEFAULT_ACL`
|
||||
|
||||
### Laravel Applications (Pixelfed)
|
||||
- **S3 Default Disk**: `DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3` required
|
||||
- **Cache Invalidation**: `php artisan config:cache` after S3 changes
|
||||
- **Dedicated Buckets**: Avoid prefix conflicts with dedicated bucket approach
|
||||
|
||||
## Operational Tools & Management
|
||||
|
||||
### Administrative Access ✅ SECURED
|
||||
- **kubectl Context**: `admin@keyboardvagabond-tailscale` (internal VLAN IP)
|
||||
- **Tailscale Client**: CGNAT range 100.64.0.0/10 access only
|
||||
- **Harbor Registry**: Direct HTTPS access (Zero Trust incompatible)
|
||||
|
||||
### Essential Commands
|
||||
```bash
|
||||
# Talos cluster management (Tailscale VPN required)
|
||||
talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
|
||||
talosctl health
|
||||
|
||||
# Kubernetes cluster access
|
||||
kubectl config use-context admin@keyboardvagabond-tailscale
|
||||
kubectl get nodes
|
||||
|
||||
# SOPS secret management
|
||||
sops -e -i secrets.yaml
|
||||
sops -d secrets.yaml | kubectl apply -f -
|
||||
|
||||
# Flux GitOps management
|
||||
flux get sources all
|
||||
flux reconcile source git flux-system
|
||||
```
|
||||
|
||||
### Terminal Environment Notes
|
||||
- **PowerShell on macOS**: PSReadLine may display errors but commands execute successfully
|
||||
- **Terminal Preference**: Use default OS terminal over PowerShell (except Windows)
|
||||
- **Command Output**: Despite display issues, outputs remain readable and functional
|
||||
|
||||
## Scaling Preparation
|
||||
- **Node Addition**: NetCup Cloud vLAN 1004963 with sequential IPs (10.132.0.x/24)
|
||||
- **Storage Scaling**: Longhorn distributed across nodes with S3 backup integration
|
||||
- **Load Balancing**: MetalLB or cloud load balancer integration ready
|
||||
- **High Availability**: Additional control plane nodes can be added
|
||||
|
||||
## Troubleshooting Patterns
|
||||
|
||||
### Zero Trust Issues
|
||||
- **Corporate VPN Blocking**: SSL handshake failures - test from different networks
|
||||
- **Service Discovery**: Check label mismatch between service selector and pod labels
|
||||
- **StatefulSet Issues**: Use manual Helm deployment for immutable field changes
|
||||
|
||||
### Common Application Issues
|
||||
- **PHP Applications**: Clear Laravel config cache after environment changes
|
||||
- **Flask Applications**: Verify uWSGI Python version matches venv
|
||||
- **Django Applications**: Ensure theme compilation before static file collection
|
||||
- **Container Builds**: Multi-stage builds reduce size but require careful dependency management
|
||||
|
||||
### Network & Storage Issues
|
||||
- **Longhorn**: Check replica distribution across nodes
|
||||
- **S3 Backup**: Verify volume labels for backup inclusion
|
||||
- **Database**: Use read replicas for read-heavy operations
|
||||
- **CDN**: Dedicated buckets eliminate prefix conflicts
|
||||
|
||||
## Performance Optimizations
|
||||
- **CDN Caching**: Cloudflare cache rules for static assets (1 year cache)
|
||||
- **Image Processing**: Background workers handle optimization and federation
|
||||
- **Database Optimization**: Read replicas and proper indexing
|
||||
- **ActivityPub Rate Limiting**: 10r/s with 300 request burst buffer
|
||||
|
||||
## Future Development Guidelines
|
||||
- **New Services**: Zero Trust ingress pattern mandatory (no cert-manager/external-dns)
|
||||
- **Security**: Never expose external ingress ports - all traffic via Cloudflare tunnels
|
||||
- **CDN Strategy**: Use dedicated S3 buckets per application
|
||||
- **Subdomains**: Cloudflare Free plan supports only one level (`app.domain.com`)
|
||||
|
||||
@development-workflow-template.yaml
|
||||
@container-build-template.dockerfile
|
||||
@troubleshooting-history.mdc
|
||||
@talos-config-template.yaml
|
||||
124
.cursor/rules/fediverse-app-template.yaml
Normal file
124
.cursor/rules/fediverse-app-template.yaml
Normal file
@@ -0,0 +1,124 @@
|
||||
# Fediverse Application Deployment Template
|
||||
# Multi-container architecture with web, worker, and optional beat containers
|
||||
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: app-web
|
||||
namespace: app-namespace
|
||||
spec:
|
||||
replicas: 2
|
||||
selector:
|
||||
matchLabels:
|
||||
app: app-name
|
||||
component: web
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: app-name
|
||||
component: web
|
||||
spec:
|
||||
containers:
|
||||
- name: web
|
||||
image: <YOUR_REGISTRY_URL>/library/app-name:latest
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
env:
|
||||
- name: DATABASE_URL
|
||||
value: "postgresql://user:password@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/app_db"
|
||||
- name: REDIS_URL
|
||||
value: "redis://:password@redis-ha-haproxy.redis-system.svc.cluster.local:6379/0"
|
||||
- name: S3_BUCKET
|
||||
value: "app-bucket"
|
||||
- name: S3_CDN_URL
|
||||
value: "https://cdn.keyboardvagabond.com"
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: app-secret
|
||||
- configMapRef:
|
||||
name: app-config
|
||||
volumeMounts:
|
||||
- name: app-storage
|
||||
mountPath: /app/storage
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "100m"
|
||||
limits:
|
||||
memory: "1Gi"
|
||||
cpu: "500m"
|
||||
volumes:
|
||||
- name: app-storage
|
||||
persistentVolumeClaim:
|
||||
claimName: app-storage-pvc
|
||||
|
||||
---
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: app-worker
|
||||
namespace: app-namespace
|
||||
spec:
|
||||
replicas: 1
|
||||
selector:
|
||||
matchLabels:
|
||||
app: app-name
|
||||
component: worker
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: app-name
|
||||
component: worker
|
||||
spec:
|
||||
containers:
|
||||
- name: worker
|
||||
image: <YOUR_REGISTRY_URL>/library/app-worker:latest
|
||||
command: ["worker-command"] # Framework-specific worker command
|
||||
env:
|
||||
- name: DATABASE_URL
|
||||
value: "postgresql://user:password@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/app_db"
|
||||
- name: REDIS_URL
|
||||
value: "redis://:password@redis-ha-haproxy.redis-system.svc.cluster.local:6379/0"
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: app-secret
|
||||
- configMapRef:
|
||||
name: app-config
|
||||
resources:
|
||||
requests:
|
||||
memory: "128Mi"
|
||||
cpu: "50m"
|
||||
limits:
|
||||
memory: "512Mi"
|
||||
cpu: "200m"
|
||||
|
||||
---
|
||||
# Optional: Celery Beat for Django applications (single replica only)
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: app-beat
|
||||
namespace: app-namespace
|
||||
spec:
|
||||
replicas: 1 # CRITICAL: Never scale beyond 1 replica
|
||||
strategy:
|
||||
type: Recreate # Ensures only one scheduler runs
|
||||
selector:
|
||||
matchLabels:
|
||||
app: app-name
|
||||
component: beat
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: app-name
|
||||
component: beat
|
||||
spec:
|
||||
containers:
|
||||
- name: beat
|
||||
image: <YOUR_REGISTRY_URL>/library/app-worker:latest
|
||||
command: ["celery", "-A", "app", "beat", "-l", "info", "--scheduler", "django_celery_beat.schedulers:DatabaseScheduler"]
|
||||
envFrom:
|
||||
- secretRef:
|
||||
name: app-secret
|
||||
- configMapRef:
|
||||
name: app-config
|
||||
157
.cursor/rules/infrastructure.mdc
Normal file
157
.cursor/rules/infrastructure.mdc
Normal file
@@ -0,0 +1,157 @@
|
||||
---
|
||||
description: Infrastructure components configuration and deployment patterns
|
||||
globs: ["manifests/infrastructure/**/*", "manifests/cluster/**/*"]
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Infrastructure Components ✅ OPERATIONAL
|
||||
|
||||
## Core Infrastructure Stack
|
||||
Located in `manifests/infrastructure/`:
|
||||
- **Networking**: Cilium CNI with host firewall and Hubble UI ✅ **OPERATIONAL**
|
||||
- **Storage**: Longhorn distributed storage (2-replica configuration) ✅ **OPERATIONAL**
|
||||
- **Ingress**: NGINX Ingress Controller with hostNetwork enabled (Zero Trust mode) ✅ **OPERATIONAL**
|
||||
- **Zero Trust Tunnels**: Cloudflared deployment in `cloudflared-system` namespace ✅ **OPERATIONAL**
|
||||
- **Registry**: Harbor container registry (`<YOUR_REGISTRY_URL>`) ✅ **OPERATIONAL**
|
||||
- **Monitoring**: OpenTelemetry Operator + OpenObserve (O2) ✅ **OPERATIONAL**
|
||||
- **Database**: PostgreSQL with CloudNativePG operator ✅ **OPERATIONAL**
|
||||
- **Identity**: Authentik open-source IAM ✅ **OPERATIONAL**
|
||||
- **VPN**: Tailscale mesh VPN for administrative access ✅ **OPERATIONAL**
|
||||
|
||||
## Component Status Matrix
|
||||
### Active Components ✅ OPERATIONAL
|
||||
- **Cilium**: CNI with kube-proxy replacement, host firewall
|
||||
- **Longhorn**: Distributed storage with S3 backup to Backblaze B2
|
||||
- **PostgreSQL**: 3-instance HA cluster with comprehensive monitoring
|
||||
- **Harbor**: Container registry (direct HTTPS - Zero Trust incompatible)
|
||||
- **OpenObserve**: Monitoring and observability platform
|
||||
- **Authentik**: Open-source identity and access management
|
||||
- **Renovate**: Automated dependency updates ✅ **ACTIVE**
|
||||
|
||||
### Disabled/Deprecated Components
|
||||
- **external-dns**: ❌ **REMOVED** (replaced by Zero Trust tunnels)
|
||||
- **cert-manager**: ❌ **REMOVED** (replaced by Cloudflare edge TLS)
|
||||
- **Rook-Ceph**: ⏸️ **DISABLED** (complexity - using Longhorn instead)
|
||||
- **Flux GitOps**: ⏸️ **DISABLED** (manual deployment - ready for re-activation)
|
||||
|
||||
### Development/Optional Components
|
||||
- **Elasticsearch**: ✅ **OPERATIONAL** (log aggregation)
|
||||
- **Kibana**: ✅ **OPERATIONAL** (log analytics via Zero Trust tunnel)
|
||||
|
||||
## Network Configuration ✅ OPERATIONAL
|
||||
- **NetCup Cloud vLAN**: VLAN ID 1004963 for internal cluster communication
|
||||
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
|
||||
- **Node IPs** (all control plane nodes):
|
||||
- n1 (152.53.107.24): Public + 10.132.0.10/24 (VLAN)
|
||||
- n2 (152.53.105.81): Public + 10.132.0.20/24 (VLAN)
|
||||
- n3 (152.53.200.111): Public + 10.132.0.30/24 (VLAN)
|
||||
- **DNS Domain**: Uses standard `cluster.local` for maximum compatibility
|
||||
- **CNI**: Cilium with kube-proxy replacement
|
||||
- **Service Mesh**: Cilium with Hubble for observability
|
||||
|
||||
## Storage Configuration ✅ OPERATIONAL
|
||||
### Longhorn Storage
|
||||
- **Default Path**: `/var/lib/longhorn`
|
||||
- **Replica Count**: 2 (distributed across nodes)
|
||||
- **Storage Class**: `longhorn-retain` for data preservation
|
||||
- **S3 Backup**: Backblaze B2 integration with label-based volume selection
|
||||
|
||||
### S3 Backup Configuration
|
||||
- **Provider**: Backblaze B2 Cloud Storage
|
||||
- **Cost**: $6/TB storage with $0 egress fees via Cloudflare partnership
|
||||
- **Volume Selection**: Label-based tagging system for selective backup
|
||||
- **Disaster Recovery**: Automated backup scheduling and restore capabilities
|
||||
|
||||
## Database Configuration ✅ OPERATIONAL
|
||||
### PostgreSQL with CloudNativePG
|
||||
- **Cluster Name**: `postgres-shared` in `postgresql-system` namespace
|
||||
- **High Availability**: 3-instance cluster with automatic failover
|
||||
- **Instances**: `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5`
|
||||
- **Monitoring**: Port 9187 for comprehensive metrics export
|
||||
- **Backup Strategy**: Integrated with S3 backup system via Longhorn volume labels
|
||||
|
||||
## Cache Configuration ✅ OPERATIONAL
|
||||
### Redis HA Cluster
|
||||
- **Helm Chart**: `redis-ha` from `dandydeveloper/charts` (replaced deprecated Bitnami chart)
|
||||
- **Namespace**: `redis-system`
|
||||
- **Architecture**: 3 Redis replicas with Sentinel for HA, 3 HAProxy pods for load balancing
|
||||
- **Connection String**: `redis-ha-haproxy.redis-system.svc.cluster.local:6379`
|
||||
- **HAProxy**: Provides unified read/write endpoint managed by 3 HAProxy pods
|
||||
- **Storage**: Longhorn persistent volumes (20Gi per Redis instance)
|
||||
- **Authentication**: SOPS-encrypted credentials in `redis-credentials` secret
|
||||
- **Monitoring**: Redis exporter and HAProxy metrics via ServiceMonitor
|
||||
|
||||
### PostgreSQL Comprehensive Metrics ✅ OPERATIONAL
|
||||
- **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}`
|
||||
- **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback`
|
||||
- **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit`
|
||||
- **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version`
|
||||
- **Security**: Role-based access control with `pg_monitor` role for metrics collection
|
||||
- **Backup Integration**: Native support for WAL archiving and point-in-time recovery
|
||||
- **Custom Queries**: ConfigMap-based custom query system with proper RBAC permissions
|
||||
- **Dashboard Integration**: Native OpenObserve integration with predefined monitoring queries
|
||||
|
||||
## Security & Access Control ✅ ZERO TRUST ARCHITECTURE
|
||||
### Zero Trust Migration ✅ COMPLETED
|
||||
- **Migration Status**: 10 of 11 external services migrated to Cloudflare Zero Trust tunnels
|
||||
- **Harbor Exception**: Direct port exposure (80/443) due to header modification issues
|
||||
- **Dependencies Removed**: external-dns and cert-manager no longer needed
|
||||
- **Security Improvement**: No external ingress ports exposed
|
||||
|
||||
### Tailscale Administrative Access ✅ IMPLEMENTED
|
||||
- **Deployment Model**: Tailscale Operator Helm Chart (v1.90.x)
|
||||
- **Operator**: Deployed in `tailscale-system` namespace with 2 replicas
|
||||
- **Subnet Router**: Connector resource advertising internal networks (Pod: 10.244.0.0/16, Service: 10.96.0.0/12, VLAN: 10.132.0.0/24)
|
||||
- **Magic DNS**: Services can be exposed via Tailscale operator with meta attributes for DNS resolution
|
||||
- **OAuth Integration**: Device authentication and tagging with `tag:k8s-operator`
|
||||
- **Hostname**: `keyboardvagabond-operator` for operator, `keyboardvagabond-cluster` for subnet router
|
||||
|
||||
## Infrastructure Deployment Patterns
|
||||
### Kustomize Configuration
|
||||
```yaml
|
||||
# Standard kustomization.yaml structure
|
||||
apiVersion: kustomize.config.k8s.io/v1beta1
|
||||
kind: Kustomization
|
||||
namespace: component-namespace
|
||||
resources:
|
||||
- namespace.yaml
|
||||
- component.yaml
|
||||
- monitoring.yaml
|
||||
```
|
||||
|
||||
### Helm Integration
|
||||
```yaml
|
||||
# HelmRelease for complex applications
|
||||
apiVersion: helm.toolkit.fluxcd.io/v2beta1
|
||||
kind: HelmRelease
|
||||
metadata:
|
||||
name: component-name
|
||||
namespace: component-namespace
|
||||
spec:
|
||||
chart:
|
||||
spec:
|
||||
chart: chart-name
|
||||
sourceRef:
|
||||
kind: HelmRepository
|
||||
name: repo-name
|
||||
```
|
||||
|
||||
## Operational Procedures
|
||||
|
||||
### Node Addition and Scaling
|
||||
When adding new nodes to the cluster, specific steps are required to ensure monitoring and metrics collection continue working properly:
|
||||
|
||||
- **Nginx Ingress Metrics**: See `docs/NODE-ADDITION-GUIDE.md` for complete procedures
|
||||
- Nginx ingress controller deploys automatically (DaemonSet)
|
||||
- OpenTelemetry collector static scrape configuration requires manual update
|
||||
- Must add new node IP to targets list in `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
|
||||
- Verification steps include checking metrics endpoints and collector logs
|
||||
|
||||
### Key Files for Node Operations
|
||||
- **Monitoring Configuration**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
|
||||
- **Network Policies**: `manifests/infrastructure/cluster-policies/host-fw-*.yaml`
|
||||
- **Node Addition Guide**: `docs/NODE-ADDITION-GUIDE.md`
|
||||
|
||||
@zero-trust-ingress-template.yaml
|
||||
@longhorn-storage-template.yaml
|
||||
@postgresql-database-template.yaml
|
||||
128
.cursor/rules/longhorn-storage-template.yaml
Normal file
128
.cursor/rules/longhorn-storage-template.yaml
Normal file
@@ -0,0 +1,128 @@
|
||||
# Longhorn Storage Templates
|
||||
# Persistent volume configurations with backup labels
|
||||
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: app-storage-pvc
|
||||
namespace: app-namespace
|
||||
labels:
|
||||
# S3 backup inclusion labels
|
||||
recurring-job.longhorn.io/backup: enabled
|
||||
recurring-job-group.longhorn.io/backup: enabled
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteMany # Default for applications that may scale horizontally
|
||||
# Use ReadWriteOnce for:
|
||||
# - Single-instance applications (databases, stateful apps)
|
||||
# - CloudNativePG (manages its own storage replication)
|
||||
# - Applications with file locking requirements
|
||||
storageClassName: longhorn-retain # Data preservation on deletion
|
||||
resources:
|
||||
requests:
|
||||
storage: 10Gi
|
||||
|
||||
---
|
||||
# Longhorn StorageClass with retain policy
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: longhorn-retain
|
||||
provisioner: driver.longhorn.io
|
||||
allowVolumeExpansion: true
|
||||
reclaimPolicy: Retain # Preserves data on PVC deletion
|
||||
volumeBindingMode: Immediate
|
||||
parameters:
|
||||
numberOfReplicas: "2" # 2-replica redundancy
|
||||
staleReplicaTimeout: "2880" # 48 hours
|
||||
fromBackup: ""
|
||||
fsType: "xfs"
|
||||
dataLocality: "disabled" # Allow cross-node placement
|
||||
|
||||
---
|
||||
# Longhorn Backup Target Configuration
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: longhorn-backup-target
|
||||
namespace: longhorn-system
|
||||
type: Opaque
|
||||
data:
|
||||
# Backblaze B2 credentials (base64 encoded, encrypted by SOPS)
|
||||
AWS_ACCESS_KEY_ID: base64-encoded-key-id
|
||||
AWS_SECRET_ACCESS_KEY: base64-encoded-secret-key
|
||||
AWS_ENDPOINTS: aHR0cHM6Ly9zMy5ldS1jZW50cmFsLTAwMy5iYWNrYmxhemViMi5jb20= # Base64: https://s3.eu-central-003.backblazeb2.com
|
||||
|
||||
---
|
||||
# Longhorn RecurringJob for S3 Backup
|
||||
apiVersion: longhorn.io/v1beta2
|
||||
kind: RecurringJob
|
||||
metadata:
|
||||
name: backup-to-s3
|
||||
namespace: longhorn-system
|
||||
spec:
|
||||
cron: "0 2 * * *" # Daily at 2 AM
|
||||
task: "backup"
|
||||
groups:
|
||||
- backup
|
||||
retain: 7 # Keep 7 daily backups
|
||||
concurrency: 2 # Concurrent backup jobs
|
||||
labels:
|
||||
recurring-job: backup-to-s3
|
||||
|
||||
---
|
||||
# Volume labeling example for backup inclusion
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: example-pv
|
||||
labels:
|
||||
# These labels ensure volume is included in S3 backup jobs
|
||||
recurring-job.longhorn.io/backup: enabled
|
||||
recurring-job-group.longhorn.io/backup: enabled
|
||||
spec:
|
||||
capacity:
|
||||
storage: 10Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: longhorn-retain
|
||||
csi:
|
||||
driver: driver.longhorn.io
|
||||
volumeHandle: example-volume-id
|
||||
|
||||
# Example: Database storage (ReadWriteOnce required)
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: postgres-storage-pvc
|
||||
namespace: postgresql-system
|
||||
labels:
|
||||
recurring-job.longhorn.io/backup: enabled
|
||||
recurring-job-group.longhorn.io/backup: enabled
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce # Required for databases - single writer only
|
||||
storageClassName: longhorn-retain
|
||||
resources:
|
||||
requests:
|
||||
storage: 50Gi
|
||||
|
||||
# Access Mode Guidelines:
|
||||
# - ReadWriteMany (RWX): Default for horizontally scalable applications
|
||||
# * Web applications that can run multiple pods
|
||||
# * Shared file storage for multiple containers
|
||||
# * Applications without file locking conflicts
|
||||
#
|
||||
# - ReadWriteOnce (RWO): Required for specific use cases
|
||||
# * Database storage (PostgreSQL, Redis) - single writer required
|
||||
# * Applications with file locking (SQLite, local file databases)
|
||||
# * StatefulSets that manage their own replication
|
||||
# * Single-instance applications by design
|
||||
|
||||
# Backup Strategy Notes:
|
||||
# - Cost: $6/TB storage with $0 egress fees via Cloudflare partnership
|
||||
# - Selection: Label-based tagging system for selective volume backup
|
||||
# - Recovery: Automated backup scheduling and restore capabilities
|
||||
# - Target: @/longhorn backup location in Backblaze B2
|
||||
202
.cursor/rules/postgresql-database-template.yaml
Normal file
202
.cursor/rules/postgresql-database-template.yaml
Normal file
@@ -0,0 +1,202 @@
|
||||
# PostgreSQL Database Templates
|
||||
# CloudNativePG cluster configuration and application integration
|
||||
|
||||
# Main PostgreSQL Cluster (already deployed as postgres-shared)
|
||||
---
|
||||
apiVersion: postgresql.cnpg.io/v1
|
||||
kind: Cluster
|
||||
metadata:
|
||||
name: postgres-shared
|
||||
namespace: postgresql-system
|
||||
spec:
|
||||
instances: 3 # High availability with automatic failover
|
||||
|
||||
postgresql:
|
||||
parameters:
|
||||
max_connections: "200"
|
||||
shared_buffers: "256MB"
|
||||
effective_cache_size: "1GB"
|
||||
|
||||
bootstrap:
|
||||
initdb:
|
||||
database: postgres
|
||||
owner: postgres
|
||||
|
||||
storage:
|
||||
storageClass: longhorn-retain
|
||||
size: 50Gi
|
||||
|
||||
monitoring:
|
||||
enabled: true
|
||||
|
||||
# Application-specific database and user creation
|
||||
---
|
||||
apiVersion: postgresql.cnpg.io/v1
|
||||
kind: Database
|
||||
metadata:
|
||||
name: app-database
|
||||
namespace: postgresql-system
|
||||
spec:
|
||||
name: app_db
|
||||
owner: app_user
|
||||
cluster:
|
||||
name: postgres-shared
|
||||
|
||||
---
|
||||
# Application database user secret
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: app-postgresql-secret
|
||||
namespace: app-namespace
|
||||
type: Opaque
|
||||
data:
|
||||
# Base64 encoded credentials (encrypted by SOPS)
|
||||
# Replace with actual base64-encoded values before encryption
|
||||
username: <REPLACE_WITH_BASE64_ENCODED_USERNAME>
|
||||
password: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
|
||||
database: <REPLACE_WITH_BASE64_ENCODED_DATABASE_NAME>
|
||||
|
||||
---
|
||||
# Connection examples for different frameworks
|
||||
|
||||
# Laravel/Pixelfed connection
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: laravel-db-config
|
||||
data:
|
||||
DB_CONNECTION: "pgsql"
|
||||
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
|
||||
DB_PORT: "5432"
|
||||
DB_DATABASE: "pixelfed"
|
||||
|
||||
# Flask/PieFed connection
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: flask-db-config
|
||||
data:
|
||||
DATABASE_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/piefed"
|
||||
|
||||
# Django/BookWyrm connection
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: django-db-config
|
||||
data:
|
||||
POSTGRES_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
|
||||
PGPORT: "5432"
|
||||
POSTGRES_DB: "bookwyrm"
|
||||
POSTGRES_USER: "bookwyrm_user"
|
||||
|
||||
# Ruby/Mastodon connection
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mastodon-db-config
|
||||
data:
|
||||
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
|
||||
DB_PORT: "5432"
|
||||
DB_NAME: "mastodon"
|
||||
DB_USER: "mastodon_user"
|
||||
|
||||
---
|
||||
# Database monitoring ServiceMonitor
|
||||
apiVersion: monitoring.coreos.com/v1
|
||||
kind: ServiceMonitor
|
||||
metadata:
|
||||
name: postgresql-metrics
|
||||
namespace: postgresql-system
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
cnpg.io/cluster: postgres-shared
|
||||
endpoints:
|
||||
- port: metrics
|
||||
interval: 30s
|
||||
path: /metrics
|
||||
|
||||
# Connection Patterns:
|
||||
# - Read/Write: postgresql-shared-rw.postgresql-system.svc.cluster.local:5432
|
||||
# - Read Only: postgresql-shared-ro.postgresql-system.svc.cluster.local:5432
|
||||
# - Read Replica: postgresql-shared-r.postgresql-system.svc.cluster.local:5432
|
||||
# - Monitoring: Port 9187 for comprehensive PostgreSQL metrics
|
||||
# - Backup: Integrated with S3 backup system via Longhorn volume labels
|
||||
|
||||
# Read Replica Usage Examples:
|
||||
|
||||
# Mastodon - Read replicas for timeline queries and caching
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mastodon-db-replica-config
|
||||
data:
|
||||
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local" # Primary for writes
|
||||
DB_REPLICA_HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local" # Read replica for queries
|
||||
DB_PORT: "5432"
|
||||
DB_NAME: "mastodon"
|
||||
# Mastodon automatically uses read replicas for timeline and cache queries
|
||||
|
||||
# PieFed - Flask app with read/write splitting
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: piefed-db-replica-config
|
||||
data:
|
||||
# Primary database for writes
|
||||
DATABASE_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/piefed"
|
||||
# Read replica for heavy queries (feeds, search, analytics)
|
||||
DATABASE_REPLICA_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-ro.postgresql-system.svc.cluster.local:5432/piefed"
|
||||
|
||||
# Authentik - Optimized performance with primary and replica load balancing
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: authentik-db-replica-config
|
||||
data:
|
||||
AUTHENTIK_POSTGRESQL__HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
|
||||
AUTHENTIK_POSTGRESQL__PORT: "5432"
|
||||
AUTHENTIK_POSTGRESQL__NAME: "authentik"
|
||||
# Authentik can use read replicas for user lookups and session validation
|
||||
AUTHENTIK_POSTGRESQL_REPLICA__HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local"
|
||||
|
||||
# BookWyrm - Django with database routing for read replicas
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: bookwyrm-db-replica-config
|
||||
data:
|
||||
POSTGRES_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local" # Primary
|
||||
POSTGRES_REPLICA_HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local" # Read replica
|
||||
PGPORT: "5432"
|
||||
POSTGRES_DB: "bookwyrm"
|
||||
# Django database routing can direct read queries to replica automatically
|
||||
|
||||
# Available Metrics:
|
||||
# - Connection: cnpg_backends_total, cnpg_pg_settings_setting{name="max_connections"}
|
||||
# - Performance: cnpg_pg_stat_database_xact_commit, cnpg_pg_stat_database_xact_rollback
|
||||
# - Storage: cnpg_pg_database_size_bytes, cnpg_pg_stat_database_blks_hit
|
||||
# - Health: cnpg_collector_up, cnpg_collector_postgres_version
|
||||
|
||||
# CRITICAL PostgreSQL Pod Management Safety ⚠️
|
||||
# Source: https://cloudnative-pg.io/documentation/1.20/failure_modes/
|
||||
|
||||
# ✅ SAFE: Proper pod deletion for failover testing
|
||||
# kubectl delete pod [primary-pod] --grace-period=1
|
||||
|
||||
# ❌ DANGEROUS: Never use grace-period=0
|
||||
# kubectl delete pod [primary-pod] --grace-period=0 # NEVER DO THIS!
|
||||
#
|
||||
# Why grace-period=0 is dangerous:
|
||||
# - Immediately removes pod from Kubernetes API without proper shutdown
|
||||
# - Doesn't ensure PID 1 process (instance manager) is shut down
|
||||
# - Operator triggers failover without guarantee primary was properly stopped
|
||||
# - Can cause misleading results in failover simulation tests
|
||||
# - Does not reflect real failure scenarios (power loss, network partition)
|
||||
|
||||
# Proper PostgreSQL Pod Operations:
|
||||
# - Use --grace-period=1 for failover simulation tests
|
||||
# - Allow CloudNativePG operator to handle automatic failover
|
||||
# - Use cnpg.io/reconciliationLoop: "disabled" annotation only for emergency manual intervention
|
||||
# - Always remove reconciliation disable annotation after emergency operations
|
||||
132
.cursor/rules/s3-storage-config-template.yaml
Normal file
132
.cursor/rules/s3-storage-config-template.yaml
Normal file
@@ -0,0 +1,132 @@
|
||||
# S3 Storage Configuration Templates
|
||||
# Framework-specific S3 integration patterns with dedicated bucket approach
|
||||
|
||||
# Laravel/Pixelfed S3 Configuration
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: pixelfed-s3-config
|
||||
data:
|
||||
# Critical Laravel S3 Configuration
|
||||
FILESYSTEM_DRIVER: "s3"
|
||||
DANGEROUSLY_SET_FILESYSTEM_DRIVER: "s3" # Required for S3 default disk
|
||||
PF_ENABLE_CLOUD: "true"
|
||||
FILESYSTEM_CLOUD: "s3"
|
||||
FILESYSTEM_DISK: "s3"
|
||||
|
||||
# Backblaze B2 S3-Compatible Storage
|
||||
AWS_BUCKET: "pixelfed-bucket" # Dedicated bucket approach
|
||||
AWS_URL: "<REPLACE_WITH_CDN_URL>" # CDN URL
|
||||
AWS_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
|
||||
AWS_ROOT: "" # Empty - no prefix needed with dedicated bucket
|
||||
AWS_USE_PATH_STYLE_ENDPOINT: "false"
|
||||
AWS_VISIBILITY: "public"
|
||||
|
||||
# Flask/PieFed S3 Configuration
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: piefed-s3-config
|
||||
data:
|
||||
# S3 Storage (Backblaze B2)
|
||||
S3_BUCKET: "piefed-bucket"
|
||||
S3_REGION: "<REPLACE_WITH_S3_REGION>"
|
||||
S3_ENDPOINT_URL: "<REPLACE_WITH_S3_ENDPOINT>"
|
||||
S3_PUBLIC_URL: "<REPLACE_WITH_CDN_URL>"
|
||||
|
||||
# Django/BookWyrm S3 Configuration
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: bookwyrm-s3-config
|
||||
data:
|
||||
# S3 Storage (Backblaze B2)
|
||||
USE_S3: "true"
|
||||
AWS_STORAGE_BUCKET_NAME: "bookwyrm-bucket"
|
||||
AWS_S3_REGION_NAME: "<REPLACE_WITH_S3_REGION>"
|
||||
AWS_S3_ENDPOINT_URL: "<REPLACE_WITH_S3_ENDPOINT>"
|
||||
AWS_S3_CUSTOM_DOMAIN: "<REPLACE_WITH_CDN_DOMAIN>"
|
||||
AWS_DEFAULT_ACL: "" # Backblaze B2 doesn't support ACLs
|
||||
|
||||
# Ruby/Mastodon S3 Configuration
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mastodon-s3-config
|
||||
data:
|
||||
# S3 Object Storage
|
||||
S3_ENABLED: "true"
|
||||
S3_BUCKET: "mastodon-bucket"
|
||||
S3_REGION: "<REPLACE_WITH_S3_REGION>"
|
||||
S3_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
|
||||
S3_HOSTNAME: "<REPLACE_WITH_S3_HOSTNAME>"
|
||||
S3_ALIAS_HOST: "<REPLACE_WITH_CDN_DOMAIN>"
|
||||
|
||||
# Generic S3 Secret Template
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: s3-credentials
|
||||
type: Opaque
|
||||
data:
|
||||
# Base64 encoded values (will be encrypted by SOPS)
|
||||
# Replace with actual base64-encoded values before encryption
|
||||
AWS_ACCESS_KEY_ID: <REPLACE_WITH_BASE64_ENCODED_KEY_ID>
|
||||
AWS_SECRET_ACCESS_KEY: <REPLACE_WITH_BASE64_ENCODED_SECRET_KEY>
|
||||
S3_KEY: <REPLACE_WITH_BASE64_ENCODED_KEY_ID> # Flask apps use this naming
|
||||
S3_SECRET: <REPLACE_WITH_BASE64_ENCODED_SECRET_KEY> # Flask apps use this naming
|
||||
|
||||
# CDN Mapping Reference
|
||||
# | Application | CDN Subdomain | S3 Bucket | Purpose |
|
||||
# |------------|---------------|-----------|---------|
|
||||
# | Pixelfed | pm.keyboardvagabond.com | pixelfed-bucket | Photo/media sharing |
|
||||
# | PieFed | pfm.keyboardvagabond.com | piefed-bucket | Forum content/uploads |
|
||||
# | Mastodon | mm.keyboardvagabond.com | mastodon-bucket | Social media/attachments |
|
||||
# | BookWyrm | bm.keyboardvagabond.com | bookwyrm-bucket | Book covers/user uploads |
|
||||
|
||||
# Redis Connection Pattern (HAProxy-based):
|
||||
# - HAProxy (Read/Write): redis-ha-haproxy.redis-system.svc.cluster.local:6379
|
||||
# - Managed by 3 HAProxy pods providing unified endpoint
|
||||
# - Redis HA cluster: 3 Redis replicas with Sentinel for HA
|
||||
# - Helm Chart: redis-ha from dandydeveloper/charts (replaced deprecated Bitnami)
|
||||
|
||||
# Redis Usage Examples:
|
||||
|
||||
# Mastodon - Redis for caching and Sidekiq job queue
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: mastodon-redis-config
|
||||
data:
|
||||
REDIS_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local" # HAProxy endpoint
|
||||
REDIS_PORT: "6379"
|
||||
|
||||
# PieFed - Flask with Redis for cache and Celery broker
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: piefed-redis-config
|
||||
data:
|
||||
# All Redis connections use HAProxy endpoint
|
||||
CACHE_REDIS_URL: "redis://:<REPLACE_WITH_REDIS_PASSWORD>@redis-ha-haproxy.redis-system.svc.cluster.local:6379/1"
|
||||
CELERY_BROKER_URL: "redis://:<REPLACE_WITH_REDIS_PASSWORD>@redis-ha-haproxy.redis-system.svc.cluster.local:6379/2"
|
||||
|
||||
# BookWyrm - Django with Redis for broker and activity streams
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: bookwyrm-redis-config
|
||||
data:
|
||||
# All Redis connections use HAProxy endpoint
|
||||
REDIS_BROKER_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local:6379"
|
||||
REDIS_ACTIVITY_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local:6379"
|
||||
REDIS_BROKER_DB_INDEX: "3"
|
||||
REDIS_ACTIVITY_DB: "4"
|
||||
176
.cursor/rules/security.mdc
Normal file
176
.cursor/rules/security.mdc
Normal file
@@ -0,0 +1,176 @@
|
||||
---
|
||||
description: Security patterns including SOPS encryption, Zero Trust, and access control
|
||||
globs: ["**/*.yaml", "machineconfigs/**/*", "secrets.yaml", "*.conf"]
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Security & Encryption ✅ OPERATIONAL
|
||||
|
||||
## 🛡️ Maximum Security Architecture Achieved
|
||||
- **🚫 Zero External Port Exposure**: No direct internet access to any cluster services
|
||||
- **🔐 Dual Security Layers**: Cloudflare Zero Trust (public apps) + Tailscale Mesh VPN (admin access)
|
||||
- **🌐 CGNAT-Only API Access**: Kubernetes/Talos APIs restricted to Tailscale network (100.64.0.0/10)
|
||||
- **🔒 Encrypted Everything**: SOPS secrets, Zero Trust tunnels, mesh VPN connections
|
||||
- **🛡️ Host Firewall**: Cilium policies blocking world access to HTTP/HTTPS ports
|
||||
|
||||
## SOPS Configuration ✅ OPERATIONAL
|
||||
### Encryption Scope
|
||||
- **Files Covered**: All YAML files in `manifests/` directory, Talos configs, machine configurations
|
||||
- **Fields Encrypted**: `data` and `stringData` fields in manifests, plus specific credential fields
|
||||
- **Key Management**: Multiple PGP keys configured for different components
|
||||
- **Workflow**: All secrets encrypted with SOPS before Git commit
|
||||
|
||||
### SOPS Usage Patterns
|
||||
```bash
|
||||
# Encrypt new secret
|
||||
sops -e -i secrets.yaml
|
||||
|
||||
# Edit encrypted secret
|
||||
sops secrets.yaml
|
||||
|
||||
# Decrypt for viewing
|
||||
sops -d secrets.yaml
|
||||
|
||||
#Decrypt in place
|
||||
sops -d -i secrets.yaml
|
||||
|
||||
# Apply encrypted manifest
|
||||
sops -d secrets.yaml | kubectl apply -f -
|
||||
```
|
||||
Sops encrypted files should be applied with kubectl in the unencrypted format, and encrypted before
|
||||
merging into source control.
|
||||
|
||||
## Zero Trust Architecture ✅ MIGRATED
|
||||
|
||||
### Zero Trust Tunnels ✅ OPERATIONAL
|
||||
- **Cloudflared Deployment**: `cloudflared-system` namespace
|
||||
- **Tunnel Architecture**: Secure connectivity without exposing ingress ports
|
||||
- **TLS Termination**: Cloudflare edge handles SSL/TLS
|
||||
- **DNS Management**: Manual DNS record creation (external-dns removed)
|
||||
|
||||
### Standard Zero Trust Ingress Pattern
|
||||
```yaml
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: app-ingress
|
||||
namespace: app-namespace
|
||||
annotations:
|
||||
# Basic NGINX Configuration only - no cert-manager or external-dns
|
||||
kubernetes.io/ingress.class: nginx
|
||||
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [] # Empty - TLS handled by Cloudflare edge
|
||||
rules:
|
||||
- host: app.keyboardvagabond.com
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: app-service
|
||||
port:
|
||||
number: 80
|
||||
```
|
||||
|
||||
### Migration Steps for Zero Trust
|
||||
1. **Remove cert-manager annotations**: `cert-manager.io/cluster-issuer`, `cert-manager.io/issuer`
|
||||
2. **Remove external-dns annotations**: `external-dns.alpha.kubernetes.io/hostname`, `external-dns.alpha.kubernetes.io/target`
|
||||
3. **Empty TLS sections**: Set `tls: []` to disable certificate generation
|
||||
4. **Configure Cloudflare tunnel**: Add hostname in Zero Trust dashboard
|
||||
5. **Test connectivity**: Use `kubectl run curl-test` to verify internal service health
|
||||
|
||||
## Access Control Matrix
|
||||
| **Resource** | **Public Access** | **Administrative Access** | **Security Method** |
|
||||
|--------------|-------------------|---------------------------|---------------------|
|
||||
| **Applications** | ✅ Cloudflare Zero Trust | ❌ Not Applicable | Authenticated tunnels |
|
||||
| **Kubernetes API** | ❌ Blocked | ✅ Tailscale Mesh VPN | CGNAT + OAuth |
|
||||
| **Talos API** | ❌ Blocked | ✅ Tailscale Mesh VPN | CGNAT + OAuth |
|
||||
| **HTTP/HTTPS Services** | ❌ Blocked | ✅ Cluster Internal Only | Host firewall |
|
||||
| **Media CDN** | ✅ Cloudflare CDN | ❌ Not Applicable | Public S3 + Edge caching |
|
||||
|
||||
## Tailscale Mesh VPN ✅ OPERATIONAL
|
||||
|
||||
### Administrative Access Configuration
|
||||
- **kubectl Context**: `admin@keyboardvagabond-tailscale` using internal VLAN IP (10.132.0.10:6443)
|
||||
- **Public Context**: `admin@keyboardvagabond.com` (blocked by firewall)
|
||||
- **Tailscale Client**: Current IP range 100.64.0.0/10 (CGNAT)
|
||||
- **Firewall Rules**: Cilium host firewall restricts API access to Tailscale network only
|
||||
|
||||
### Tailscale Subnet Router Configuration ✅ OPERATIONAL
|
||||
- **Device Name**: `keyboardvagabond-cluster`
|
||||
- **Deployment Model**: Direct deployment (not Kubernetes Operator) for simplicity
|
||||
- **Advertised Networks**:
|
||||
- **Pod Network**: 10.244.0.0/16 (Kubernetes pods)
|
||||
- **Service Network**: 10.96.0.0/12 (Kubernetes services)
|
||||
- **VLAN Network**: 10.132.0.0/24 (NetCup Cloud private network)
|
||||
- **OAuth Integration**: Client credentials for device authentication and tagging
|
||||
- **Device Tagging**: `tag:k8s-operator` for proper ACL management and identification
|
||||
- **Network Mode**: Kernel mode (`TS_USERSPACE=false`) with privileged security context
|
||||
- **State Persistence**: Kubernetes secret-based storage (`TS_KUBE_SECRET=tailscale-auth`)
|
||||
- **RBAC**: Split permissions (ClusterRole for cluster resources, Role for namespace secrets)
|
||||
|
||||
### Tailscale Deployment Pattern
|
||||
```yaml
|
||||
# Direct deployment (not Kubernetes Operator)
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: tailscale-subnet-router
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: tailscale
|
||||
env:
|
||||
- name: TS_KUBE_SECRET
|
||||
value: tailscale-auth
|
||||
- name: TS_USERSPACE
|
||||
value: "false"
|
||||
- name: TS_ROUTES
|
||||
value: "10.244.0.0/16,10.96.0.0/12,10.132.0.0/24"
|
||||
securityContext:
|
||||
privileged: true
|
||||
```
|
||||
|
||||
## Network Security ✅ OPERATIONAL
|
||||
|
||||
### Cilium Host Firewall
|
||||
```yaml
|
||||
# Host firewall blocking external access to HTTP/HTTPS
|
||||
apiVersion: cilium.io/v2
|
||||
kind: CiliumClusterwideNetworkPolicy
|
||||
metadata:
|
||||
name: host-fw-control-plane
|
||||
spec:
|
||||
nodeSelector:
|
||||
matchLabels:
|
||||
node-role.kubernetes.io/control-plane: ""
|
||||
ingress:
|
||||
- fromCIDR:
|
||||
- "100.64.0.0/10" # Tailscale CGNAT range only
|
||||
toPorts:
|
||||
- ports:
|
||||
- port: "6443"
|
||||
protocol: TCP
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
- **New Services**: All applications must use Zero Trust ingress pattern
|
||||
- **Harbor Exception**: Harbor registry requires direct port exposure (header modification issues)
|
||||
- **Secret Management**: All secrets SOPS-encrypted before Git commit
|
||||
- **Network Policies**: Cilium host firewall with CGNAT-only access
|
||||
- **Administrative Access**: Tailscale mesh VPN required for kubectl/talosctl
|
||||
|
||||
## 🏆 Security Achievements
|
||||
1. **🎯 Zero Trust Network**: No implicit trust, all access authenticated and authorized
|
||||
2. **🔐 Defense in Depth**: Multiple security layers prevent single points of failure
|
||||
3. **📊 Comprehensive Monitoring**: All traffic flows monitored via OpenObserve and Cilium Hubble
|
||||
4. **🔄 Secure GitOps**: SOPS-encrypted secrets with PGP key management
|
||||
5. **🛡️ Hardened Infrastructure**: Minimal attack surface with production-grade security controls
|
||||
|
||||
@sops-secret-template.yaml
|
||||
@zero-trust-ingress-template.yaml
|
||||
@tailscale-config-template.yaml
|
||||
48
.cursor/rules/sops-secret-template.yaml
Normal file
48
.cursor/rules/sops-secret-template.yaml
Normal file
@@ -0,0 +1,48 @@
|
||||
# SOPS Secret Template
|
||||
# Use this template for creating encrypted secrets
|
||||
|
||||
apiVersion: v1
|
||||
kind: Secret
|
||||
metadata:
|
||||
name: app-secret
|
||||
namespace: app-namespace
|
||||
type: Opaque
|
||||
data:
|
||||
# These fields will be encrypted by SOPS
|
||||
# Replace with actual base64-encoded values before encryption
|
||||
DATABASE_PASSWORD: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
|
||||
S3_ACCESS_KEY: <REPLACE_WITH_BASE64_ENCODED_KEY>
|
||||
S3_SECRET_KEY: <REPLACE_WITH_BASE64_ENCODED_SECRET>
|
||||
REDIS_PASSWORD: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
|
||||
|
||||
---
|
||||
# ConfigMap for non-sensitive configuration
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: app-config
|
||||
namespace: app-namespace
|
||||
data:
|
||||
# Database connection
|
||||
DATABASE_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
|
||||
DATABASE_PORT: "5432"
|
||||
DATABASE_NAME: "app_database"
|
||||
|
||||
# Redis connection
|
||||
REDIS_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local"
|
||||
REDIS_PORT: "6379"
|
||||
|
||||
# S3 storage configuration
|
||||
S3_BUCKET: "app-bucket"
|
||||
S3_REGION: "<REPLACE_WITH_S3_REGION>"
|
||||
S3_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
|
||||
S3_CDN_URL: "<REPLACE_WITH_CDN_URL>"
|
||||
|
||||
# Application settings
|
||||
APP_ENV: "production"
|
||||
APP_DEBUG: "false"
|
||||
|
||||
# SOPS encryption commands:
|
||||
# sops -e -i this-file.yaml
|
||||
# sops this-file.yaml # to edit
|
||||
# sops -d this-file.yaml | kubectl apply -f - # to apply
|
||||
96
.cursor/rules/talos-config-template.yaml
Normal file
96
.cursor/rules/talos-config-template.yaml
Normal file
@@ -0,0 +1,96 @@
|
||||
# Talos Configuration Templates
|
||||
# Machine configurations and Talos-specific patterns
|
||||
|
||||
# Custom Talos Factory Image
|
||||
# Uses factory image with Longhorn extension pre-installed
|
||||
TALOS_FACTORY_IMAGE: "613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4"
|
||||
|
||||
# Network Interface Configuration
|
||||
---
|
||||
apiVersion: v1alpha1
|
||||
kind: MachineConfig
|
||||
metadata:
|
||||
name: node-config
|
||||
spec:
|
||||
machine:
|
||||
network:
|
||||
interfaces:
|
||||
# Public interface (DHCP + static configuration)
|
||||
- interface: enp7s0
|
||||
dhcp: true
|
||||
addresses:
|
||||
- 152.53.107.24/24 # Example for n1
|
||||
routes:
|
||||
- network: 0.0.0.0/0
|
||||
gateway: 152.53.107.1
|
||||
|
||||
# Private VLAN interface (static configuration)
|
||||
- interface: enp9s0
|
||||
addresses:
|
||||
- 10.132.0.10/24 # Example for n1 (VLAN 1004963)
|
||||
vip:
|
||||
ip: 10.132.0.5 # Shared VIP for control plane HA
|
||||
|
||||
# Node IP Configuration
|
||||
machine:
|
||||
kubelet:
|
||||
extraArgs:
|
||||
node-ip: 152.53.107.24 # Use public IP for node reporting
|
||||
|
||||
# Node IP Mappings (NetCup Cloud vLAN 1004963)
|
||||
# All nodes are control plane nodes with shared VIP for HA
|
||||
# n1: Public 152.53.107.24 + Private 10.132.0.10/24 (Control plane)
|
||||
# n2: Public 152.53.105.81 + Private 10.132.0.20/24 (Control plane)
|
||||
# n3: Public 152.53.200.111 + Private 10.132.0.30/24 (Control plane)
|
||||
# VIP: 10.132.0.5 (shared VIP, nodes elect primary)
|
||||
|
||||
# Cluster Configuration
|
||||
---
|
||||
apiVersion: v1alpha1
|
||||
kind: ClusterConfig
|
||||
metadata:
|
||||
name: keyboardvagabond
|
||||
spec:
|
||||
clusterName: keyboardvagabond.com
|
||||
controlPlane:
|
||||
endpoint: https://10.132.0.5:6443 # VIP endpoint for HA
|
||||
|
||||
# Allow workloads on control plane
|
||||
allowSchedulingOnControlPlanes: true
|
||||
|
||||
# CNI Configuration (Cilium)
|
||||
network:
|
||||
cni:
|
||||
name: none # Cilium installed via Helm
|
||||
dnsDomain: cluster.local # Standard domain for compatibility
|
||||
|
||||
# API Server Configuration
|
||||
apiServer:
|
||||
extraArgs:
|
||||
# Enable aggregation layer for metrics
|
||||
enable-aggregator-routing: "true"
|
||||
|
||||
# Volume Configuration
|
||||
# System disk: /dev/vda with 2-50GB ephemeral storage
|
||||
# Longhorn storage: 400GB minimum on system disk at /var/lib/longhorn
|
||||
|
||||
# Administrative Access Commands
|
||||
# Recommended: Use VIP endpoint for HA
|
||||
# talosctl config endpoint 10.132.0.5 # VIP endpoint
|
||||
# talosctl config node 10.132.0.5
|
||||
# talosctl health
|
||||
# talosctl dashboard (via Tailscale VPN only)
|
||||
|
||||
# Alternative: Individual node endpoints
|
||||
# talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
|
||||
# talosctl config node 10.132.0.10
|
||||
|
||||
# kubectl Contexts:
|
||||
# - admin@keyboardvagabond-tailscale (VIP: 10.132.0.5:6443 or node IPs) - ACTIVE
|
||||
# - admin@keyboardvagabond.com (blocked by firewall, Tailscale-only access)
|
||||
|
||||
# Security Notes:
|
||||
# - API access restricted to Tailscale CGNAT range (100.64.0.0/10)
|
||||
# - Cilium host firewall blocks world access to ports 6443, 50000-50010
|
||||
# - All administrative access requires Tailscale mesh VPN connection
|
||||
# - Backup kubeconfig available as SOPS-encrypted portable configuration
|
||||
189
.cursor/rules/technical-specifications.mdc
Normal file
189
.cursor/rules/technical-specifications.mdc
Normal file
@@ -0,0 +1,189 @@
|
||||
---
|
||||
description: Detailed technical specifications for nodes, network, and Talos configuration
|
||||
globs: ["machineconfigs/**/*", "patches/**/*", "talosconfig", "kubeconfig*"]
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Technical Specifications & Low-Level Configuration
|
||||
|
||||
## Talos Configuration ✅ OPERATIONAL
|
||||
|
||||
### Custom Talos Image
|
||||
- **Factory Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4`, which includes two plugins necessary for Longhorn
|
||||
- **Extensions**: Longhorn extension included for distributed storage
|
||||
- **Version**: Talos v1.10.4 with custom factory build
|
||||
- **Architecture**: ARM64 optimized for NetCup Cloud infrastructure
|
||||
|
||||
### Patch Configuration
|
||||
Applied via `patches/` directory for cluster customization:
|
||||
- **allow-controlplane-workloads.yaml**: Enables workload scheduling on control plane
|
||||
- **cluster-name.yaml**: Sets cluster name to `keyboardvagabond.com`
|
||||
- **disable-kube-proxy-and-cni.yaml**: Disables built-in networking for Cilium
|
||||
- **etcd-patch.yaml**: etcd optimization and configuration
|
||||
- **registry-patch.yaml**: Container registry configuration
|
||||
- **worker-discovery-patch.yaml**: Worker node discovery settings
|
||||
|
||||
## Network Configuration ✅ OPERATIONAL
|
||||
|
||||
### NetCup Cloud Infrastructure
|
||||
- **vLAN ID**: 1004963 for internal cluster communication
|
||||
- **Network Range**: 10.132.0.0/24 (private VLAN)
|
||||
- **DNS Domain**: `cluster.local` (standard Kubernetes domain)
|
||||
- **Cluster Name**: `keyboardvagabond.com`
|
||||
|
||||
### Node Network Configuration
|
||||
| Node | Public IP | VLAN IP | Role | Status |
|
||||
|------|-----------|---------|------|--------|
|
||||
| **n1** | 152.53.107.24 | 10.132.0.10/24 | Control Plane | ✅ Schedulable |
|
||||
| **n2** | 152.53.105.81 | 10.132.0.20/24 | Control Plane | ✅ Schedulable |
|
||||
| **n3** | 152.53.200.111 | 10.132.0.30/24 | Control Plane | ✅ Schedulable |
|
||||
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
|
||||
- **All nodes are control plane**: High availability with etcd quorum (2 of 3 required)
|
||||
|
||||
### Network Interface Configuration
|
||||
- **`enp7s0`**: Public interface (DHCP + static configuration)
|
||||
- **`enp9s0`**: Private VLAN interface (static configuration)
|
||||
- **Internal Traffic**: Uses private VLAN for pod-to-pod and storage replication
|
||||
- **External Access**: Cloudflare Zero Trust tunnels (no direct port exposure)
|
||||
|
||||
## Administrative Access Configuration ✅ SECURED
|
||||
|
||||
### Kubernetes API Access
|
||||
- **Internal Context**: `admin@keyboardvagabond-tailscale`
|
||||
- **VIP Endpoint**: `10.132.0.5:6443` (shared VIP, recommended for HA)
|
||||
- **Node Endpoints**: `10.132.0.10:6443`, `10.132.0.20:6443`, `10.132.0.30:6443` (individual nodes)
|
||||
- **Public Context**: `admin@keyboardvagabond.com` (blocked by firewall)
|
||||
- **Public Endpoint**: `api.keyboardvagabond.com:6443` (Tailscale-only)
|
||||
- **Access Method**: Tailscale mesh VPN required (CGNAT 100.64.0.0/10)
|
||||
|
||||
### Talos API Access
|
||||
```bash
|
||||
# Talos configuration (VIP recommended for HA)
|
||||
talosctl config endpoint 10.132.0.5 # VIP endpoint
|
||||
talosctl config node 10.132.0.5 # VIP node
|
||||
|
||||
# Alternative: Individual node endpoints
|
||||
talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
|
||||
talosctl config node 10.132.0.10 # Primary endpoint
|
||||
```
|
||||
|
||||
### Essential Management Commands
|
||||
```bash
|
||||
# Cluster health check
|
||||
talosctl health --nodes 10.132.0.10,10.132.0.20,10.132.0.30
|
||||
|
||||
# Node status
|
||||
talosctl get members
|
||||
|
||||
# Kubernetes context switching
|
||||
kubectl config use-context admin@keyboardvagabond-tailscale
|
||||
|
||||
# Node status verification
|
||||
kubectl get nodes -o wide
|
||||
```
|
||||
|
||||
## Storage Configuration Details ✅ OPERATIONAL
|
||||
|
||||
### Longhorn Distributed Storage
|
||||
- **Installation Path**: `/var/lib/longhorn` on each node
|
||||
- **Replica Policy**: 2-replica configuration across nodes
|
||||
- **Storage Class**: `longhorn-retain` for data preservation
|
||||
- **Node Allocation**: 400GB+ per node on system disk
|
||||
- **Auto-balance**: Enabled for optimal distribution
|
||||
|
||||
### Volume Configuration
|
||||
- **System Disk**: `/dev/vda` with ephemeral storage
|
||||
- **Longhorn Volume**: 400GB minimum allocation per node
|
||||
- **Backup Strategy**: Label-based S3 backup selection
|
||||
- **Reclaim Policy**: Retain (prevents data loss)
|
||||
|
||||
## Tailscale Mesh VPN Configuration ✅ OPERATIONAL
|
||||
|
||||
### Tailscale Operator Deployment
|
||||
- **Helm Chart**: `tailscale-operator` from Tailscale Helm repository
|
||||
- **Version**: v1.90.x (operator v1.90.8)
|
||||
- **Namespace**: `tailscale-system`
|
||||
- **Replicas**: 2 operator pods with anti-affinity
|
||||
- **Hostname**: `keyboardvagabond-operator`
|
||||
|
||||
### Subnet Router Configuration (Connector Resource)
|
||||
- **Resource Type**: `Connector` (tailscale.com/v1alpha1)
|
||||
- **Device Name**: `keyboardvagabond-cluster`
|
||||
- **Advertised Networks**:
|
||||
- **Pod Network**: 10.244.0.0/16
|
||||
- **Service Network**: 10.96.0.0/12
|
||||
- **VLAN Network**: 10.132.0.0/24
|
||||
- **OAuth Integration**: Client credentials for device authentication
|
||||
- **Device Tagging**: `tag:k8s-operator` for ACL management
|
||||
|
||||
### Service Exposure via Magic DNS
|
||||
- **Capability**: Services can be exposed via Tailscale operator with meta attributes
|
||||
- **Magic DNS**: Automatic DNS resolution for exposed services
|
||||
- **Meta Attributes**: Can be used to configure service exposure and routing
|
||||
- **Access Control**: Cilium host firewall restricts to Tailscale only
|
||||
- **Current CGNAT Range**: 100.64.0.0/10 (Tailscale assigned)
|
||||
|
||||
## Component Status Matrix ✅ CURRENT STATE
|
||||
|
||||
### Active Components
|
||||
| Component | Status | Access Method | Notes |
|
||||
|-----------|--------|---------------|-------|
|
||||
| **Cilium CNI** | ✅ Operational | Internal | Host firewall + Hubble UI |
|
||||
| **Longhorn Storage** | ✅ Operational | Internal | 2-replica with S3 backup |
|
||||
| **PostgreSQL HA** | ✅ Operational | Internal | 3-instance CloudNativePG |
|
||||
| **Harbor Registry** | ✅ Operational | Direct HTTPS | Zero Trust incompatible |
|
||||
| **OpenObserve** | ✅ Operational | Zero Trust | Monitoring platform |
|
||||
| **Tailscale VPN** | ✅ Operational | Mesh Network | Administrative access |
|
||||
|
||||
### Disabled/Deprecated Components
|
||||
| Component | Status | Reason | Alternative |
|
||||
|-----------|--------|--------|-------------|
|
||||
| **external-dns** | ❌ Removed | Zero Trust migration | Manual DNS in Cloudflare |
|
||||
| **cert-manager** | ❌ Removed | Zero Trust migration | Cloudflare edge TLS |
|
||||
| **Rook-Ceph** | ❌ Disabled | Complexity and lack of support for partitioning a single drive | Longhorn storage |
|
||||
| **Flux GitOps** | ⏸️ Disabled | Manual deployment | Ready for re-activation |
|
||||
|
||||
### Development Components
|
||||
| Component | Status | Purpose | Access |
|
||||
|-----------|--------|---------|--------|
|
||||
| **Renovate** | ✅ Operational | Dependency updates | Automated |
|
||||
| **Elasticsearch** | ✅ Operational | Log aggregation | Internal |
|
||||
| **Kibana** | ✅ Operational | Log analytics | Zero Trust |
|
||||
|
||||
## Network Security Configuration ✅ HARDENED
|
||||
|
||||
### Cilium Host Firewall Rules
|
||||
```yaml
|
||||
# Control plane API access (Tailscale only)
|
||||
- fromCIDR: ["100.64.0.0/10"] # Tailscale CGNAT
|
||||
toPorts: [{"port": "6443", "protocol": "TCP"}]
|
||||
|
||||
# Block world access to HTTP/HTTPS
|
||||
- HTTP/HTTPS ports blocked from 0.0.0.0/0
|
||||
- Only cluster-internal and Tailscale access permitted
|
||||
```
|
||||
|
||||
### Zero Trust Architecture
|
||||
- **External Applications**: All via Cloudflare tunnels
|
||||
- **Administrative APIs**: Tailscale mesh VPN only
|
||||
- **Harbor Exception**: Direct ports 80/443 (header modification issues)
|
||||
- **Internal Services**: Cluster-local communication only
|
||||
|
||||
## Future Scaling Specifications
|
||||
|
||||
### Node Addition Process
|
||||
1. **Network**: Add to NetCup Cloud vLAN 1004963
|
||||
2. **IP Assignment**: Sequential (10.132.0.40/24, 10.132.0.50/24, etc.)
|
||||
3. **Talos Config**: Apply machine config with proper networking
|
||||
4. **Longhorn**: Automatic storage distribution across new nodes
|
||||
5. **Workload**: Immediate scheduling capability
|
||||
|
||||
### High Availability Expansion
|
||||
- **Additional Control Planes**: Can add for true HA setup
|
||||
- **Load Balancing**: MetalLB or cloud LB integration ready
|
||||
- **Database Scaling**: PostgreSQL can expand to more replicas
|
||||
- **Storage Scaling**: Longhorn distributed across all nodes
|
||||
|
||||
@talos-machine-config-template.yaml
|
||||
@cilium-network-policy-template.yaml
|
||||
@longhorn-volume-template.yaml
|
||||
149
.cursor/rules/troubleshooting-history.mdc
Normal file
149
.cursor/rules/troubleshooting-history.mdc
Normal file
@@ -0,0 +1,149 @@
|
||||
---
|
||||
description: Historical issues, lessons learned, and troubleshooting knowledge from cluster evolution
|
||||
globs: []
|
||||
alwaysApply: false
|
||||
---
|
||||
|
||||
# Troubleshooting History & Lessons Learned
|
||||
|
||||
This rule captures critical historical knowledge from the cluster's evolution, including resolved issues, migration challenges, and lessons learned that inform future decisions.
|
||||
|
||||
## 🔄 Major Architecture Migrations
|
||||
|
||||
### DNS Domain Evolution ✅ **RESOLVED**
|
||||
- **Previous Issue**: Used custom `local.keyboardvagabond.com` domain causing compatibility problems
|
||||
- **Resolution**: Reverted to standard `cluster.local` domain
|
||||
- **Benefits**: Full compatibility with monitoring dashboards, service discovery, and all Kubernetes tooling
|
||||
- **Lesson**: Always use standard Kubernetes domains unless absolutely necessary
|
||||
|
||||
### Zero Trust Migration ✅ **COMPLETED**
|
||||
- **Migration Scope**: 10 of 11 external services migrated from external-dns/cert-manager to Cloudflare Zero Trust tunnels
|
||||
- **Services Migrated**: Mastodon, Mastodon Streaming, Pixelfed, PieFed, Picsur, BookWyrm, Authentik, OpenObserve, Kibana, WriteFreely
|
||||
- **Harbor Exception**: Harbor registry reverted to direct port exposure (80/443) due to Cloudflare header modification breaking container image layer writes
|
||||
- **Dependencies Removed**: external-dns and cert-manager components no longer needed
|
||||
- **Key Challenges Resolved**: Mastodon streaming subdomain compatibility, StatefulSet immutable fields, service discovery issues
|
||||
|
||||
## 🛠️ Historical Technical Issues
|
||||
|
||||
### DNS and External-DNS Resolution ✅ **RESOLVED & DEPRECATED**
|
||||
- **Previous Issue**: External-DNS creating records with private VLAN IPs (10.132.0.x) which Cloudflare rejected
|
||||
- **Temporary Solution**: Used `external-dns.alpha.kubernetes.io/target` annotations with public IPs
|
||||
- **Target Annotations**: `152.53.107.24,152.53.105.81` were used for all ingress resources
|
||||
- **Final Resolution**: **External-DNS completely removed in favor of Cloudflare Zero Trust tunnels**
|
||||
- **Current Status**: Manual DNS record creation via Cloudflare Dashboard (external-dns no longer needed)
|
||||
|
||||
### SSL Certificate Issues ✅ **RESOLVED**
|
||||
- **Previous Issue**: Let's Encrypt certificates stuck in "False/Not Ready" state due to DNS resolution failures
|
||||
- **Resolution**: DNS records now resolve correctly, enabling HTTP-01 challenge completion
|
||||
- **Migration**: Eventually replaced by Zero Trust architecture eliminating certificate management
|
||||
|
||||
### Node IP Configuration ✅ **IMPLEMENTED**
|
||||
- **Approach**: Using kubelet `extraArgs` with `node-ip` parameter
|
||||
- **n2 Status**: ✅ Successfully reporting public IP (152.53.105.81)
|
||||
- **Backup Strategy**: Target annotations provide reliable DNS record creation regardless of node IP status
|
||||
|
||||
## 🔍 Framework-Specific Lessons Learned
|
||||
|
||||
### CDN Storage Evolution: Shared vs Dedicated Buckets
|
||||
**Original Plan**: Single bucket with prefixes (`/pixelfed`, `/piefed`, `/mastodon`)
|
||||
**Issue Discovered**: Pixelfed demonstrated inconsistent prefix handling, sometimes failing to return URLs with correct subdirectory
|
||||
**Solution**: Dedicated buckets eliminate compatibility issues entirely
|
||||
|
||||
**Benefits of Dedicated Bucket Approach**:
|
||||
- **Application Compatibility**: Some applications don't fully support S3 prefixes
|
||||
- **No Prefix Conflicts**: Eliminates S3 path prefix issues with shared buckets
|
||||
- **Simplified Configuration**: Clean S3 endpoints without complex path rewriting
|
||||
- **Independent Scaling**: Each application can optimize caching independently
|
||||
|
||||
### Mastodon Streaming Subdomain Challenge ✅ **FIXED**
|
||||
- **Original**: `streaming.mastodon.keyboardvagabond.com`
|
||||
- **Issue**: Cloudflare Free plan subdomain limitation (not supported)
|
||||
- **Solution**: Changed to `streamingmastodon.keyboardvagabond.com` ✅ **WORKING**
|
||||
- **Lesson**: Cloudflare Free plan supports only one subdomain level (`app.domain.com` not `sub.app.domain.com`)
|
||||
|
||||
### Flask Application Discovery Patterns
|
||||
**Critical Framework Identification**: Must identify Flask vs Django early in development
|
||||
- **Flask**: Uses `flask` command, URL-based config (DATABASE_URL), application factory pattern
|
||||
- **Django**: Uses `python manage.py` commands, separate host/port variables, standard project structure
|
||||
- **uWSGI Integration**: Must use same Python version as venv; install via pip, not Alpine packages
|
||||
- **Static Files**: Flask with application factory has nested structure (`/app/app/static/`)
|
||||
|
||||
### Laravel S3 Configuration Discoveries
|
||||
**Critical Laravel S3 Settings**:
|
||||
- **`DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3`**: Essential to make S3 the default filesystem
|
||||
- **Cache Invalidation**: Must run `php artisan config:cache` after S3 (or any) configuration changes
|
||||
- **Dedicated Buckets**: Prevents double-prefix issues that occur with shared buckets
|
||||
|
||||
### Django Static File Pipeline
|
||||
**Theme Compilation Order**: Must compile themes **before** static file collection to S3
|
||||
- **Correct Pipeline**: `compile_themes` → `collectstatic` → S3 upload
|
||||
- **Backblaze B2**: Requires empty `AWS_DEFAULT_ACL` due to no ACL support
|
||||
- **Container Builds**: Theme compilation at runtime (not build time) requires database access
|
||||
|
||||
## 🚨 Zero Trust Migration Issues Resolved
|
||||
|
||||
### Common Migration Problems
|
||||
- **Mastodon Streaming**: Fixed subdomain compatibility for Cloudflare Free plan
|
||||
- **OpenObserve StatefulSet**: Used manual Helm deployment to bypass immutable field restrictions
|
||||
- **Picsur Service Discovery**: Fixed label mismatch between service selector and pod labels
|
||||
- **Corporate VPN Blocking**: SSL handshake failures resolved by testing from different networks
|
||||
|
||||
### Harbor Registry Exception
|
||||
**Why Harbor Can't Use Zero Trust**:
|
||||
- **Issue**: Cloudflare header modification breaks container image layer writes
|
||||
- **Solution**: Direct port exposure (80/443) for Harbor only
|
||||
- **Security**: All other services use Zero Trust tunnels
|
||||
|
||||
## 🔧 Infrastructure Evolution Context
|
||||
|
||||
### Talos Configuration
|
||||
- **Custom Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4` with Longhorn extension
|
||||
- **Network Interfaces**:
|
||||
- `enp7s0`: Public interface (DHCP + static configuration)
|
||||
- `enp9s0`: Private VLAN interface (static configuration)
|
||||
|
||||
### Storage Evolution
|
||||
- **Original**: Basic Longhorn setup
|
||||
- **Current**: 2-replica configuration with S3 backup integration
|
||||
- **Backup Strategy**: Label-based volume selection system
|
||||
- **Cost Optimization**: $6/TB with $0 egress via Cloudflare partnership
|
||||
|
||||
### Administrative Access Evolution
|
||||
- **Original**: Direct public API access
|
||||
- **Migration**: Tailscale mesh VPN implementation
|
||||
- **Current**: CGNAT-only access (100.64.0.0/10) via mesh network
|
||||
- **Security**: Zero external API exposure
|
||||
|
||||
## 📊 Operational Patterns Discovered
|
||||
|
||||
### Multi-Stage Docker Benefits
|
||||
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage builds (~75% reduction)
|
||||
- **Essential for**: Python/Node.js applications to remove build dependencies
|
||||
- **Pattern**: Base image → Web container → Worker container specialization
|
||||
|
||||
### ActivityPub Rate Limiting Implementation
|
||||
**Based on**: [PieFed blog recommendations](https://join.piefed.social/2024/04/17/handling-large-bursts-of-post-requests-to-your-activitypub-inbox-using-a-buffer-in-nginx/)
|
||||
- **Rate**: 10 requests/second with 300 request burst buffer
|
||||
- **Memory**: 100MB zone sufficient for large-scale instances
|
||||
- **Federation Impact**: Graceful handling of viral content spikes
|
||||
|
||||
### Terminal Environment Discovery
|
||||
- **PowerShell on macOS**: PSReadLine displays errors but commands execute successfully
|
||||
- **Recommendation**: Use default OS terminal over PowerShell (except Windows)
|
||||
- **Functionality**: Command outputs remain readable despite display issues
|
||||
|
||||
## 🎯 Critical Success Factors
|
||||
|
||||
### What Made Migrations Successful
|
||||
1. **Gradual Migration**: One service at a time instead of big-bang approach
|
||||
2. **Testing Pattern**: `kubectl run curl-test` to verify internal service health
|
||||
3. **Backup Strategies**: Target annotations as fallback for DNS issues
|
||||
4. **Documentation**: Detailed tracking of each migration step and issue resolution
|
||||
|
||||
### Patterns to Avoid
|
||||
1. **Custom DNS Domains**: Stick to `cluster.local` for compatibility
|
||||
2. **Shared S3 Buckets**: Use dedicated buckets to avoid prefix conflicts
|
||||
3. **Complex Subdomains**: Cloudflare Free plan limitations require simple patterns
|
||||
4. **Single-Stage Containers**: Multi-stage builds essential for production efficiency
|
||||
|
||||
This historical knowledge should inform all future architectural decisions and troubleshooting approaches.
|
||||
54
.cursor/rules/zero-trust-ingress-template.yaml
Normal file
54
.cursor/rules/zero-trust-ingress-template.yaml
Normal file
@@ -0,0 +1,54 @@
|
||||
# Zero Trust Ingress Template
|
||||
# Use this template for all new applications deployed via Cloudflare tunnels
|
||||
|
||||
apiVersion: networking.k8s.io/v1
|
||||
kind: Ingress
|
||||
metadata:
|
||||
name: app-ingress
|
||||
namespace: app-namespace
|
||||
annotations:
|
||||
# Basic NGINX Configuration only - no cert-manager or external-dns
|
||||
kubernetes.io/ingress.class: nginx
|
||||
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
|
||||
|
||||
# Optional: Extended timeouts for long-running requests
|
||||
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
|
||||
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
|
||||
|
||||
# Optional: ActivityPub rate limiting for fediverse applications
|
||||
nginx.ingress.kubernetes.io/server-snippet: |
|
||||
limit_req_zone $binary_remote_addr zone=app_inbox:100m rate=10r/s;
|
||||
nginx.ingress.kubernetes.io/configuration-snippet: |
|
||||
location ~* ^/(inbox|users/.*/inbox) {
|
||||
limit_req zone=app_inbox burst=300;
|
||||
}
|
||||
|
||||
spec:
|
||||
ingressClassName: nginx
|
||||
tls: [] # Empty - TLS handled by Cloudflare edge
|
||||
rules:
|
||||
- host: app.keyboardvagabond.com
|
||||
http:
|
||||
paths:
|
||||
- path: /
|
||||
pathType: Prefix
|
||||
backend:
|
||||
service:
|
||||
name: app-service
|
||||
port:
|
||||
number: 80
|
||||
|
||||
---
|
||||
# Service template
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: app-service
|
||||
namespace: app-namespace
|
||||
spec:
|
||||
selector:
|
||||
app: app-name
|
||||
ports:
|
||||
- name: http
|
||||
port: 80
|
||||
targetPort: 8080
|
||||
Reference in New Issue
Block a user