add source code and readme

This commit is contained in:
2025-12-24 14:35:17 +01:00
parent 7c92e1e610
commit 74324d5a1b
331 changed files with 39272 additions and 1 deletions

View File

@@ -0,0 +1,58 @@
---
description: Keyboard Vagabond project overview and core infrastructure context
globs: []
alwaysApply: true
---
# Keyboard Vagabond - Project Overview
## System Overview
This is a **Talos-based Kubernetes cluster** designed to host **fediverse applications** for <200 MAU (Monthly Active Users):
- **Mastodon** (Twitter-like microblogging) ✅ OPERATIONAL
- **Pixelfed** (Instagram-like photo sharing) ✅ OPERATIONAL
- **PieFed** (Reddit-like forum) ✅ OPERATIONAL
- **BookWyrm** (Social reading platform) ✅ OPERATIONAL
- **Matrix** (Chat/messaging) - Future deployment
## Architecture Summary ✅ OPERATIONAL
- **Three ARM64 Nodes**: n1, n2, n3 (all control plane nodes with VIP 10.132.0.5)
- **Zero Trust Security**: Cloudflare tunnels + Tailscale mesh VPN
- **Storage**: Longhorn distributed with S3 backup to Backblaze B2
- **Database**: PostgreSQL HA cluster with CloudNativePG operator
- **Cache**: Redis HA cluster with HAProxy (redis-ha-haproxy.redis-system.svc.cluster.local)
- **Monitoring**: OpenTelemetry + OpenObserve (O2)
- **Registry**: Harbor container registry
- **CDN**: Per-application Cloudflare CDN with dedicated S3 buckets
## Project Structure
```
keyboard-vagabond/
├── .cursor/rules/ # Cursor rules (this directory)
├── docs/ # Operational documentation and guides
├── manifests/ # Kubernetes manifests
│ ├── infrastructure/ # Core infrastructure components
│ ├── applications/ # Fediverse applications
│ └── cluster/flux-system/ # GitOps configuration
├── build/ # Custom container builds
├── machineconfigs/ # Talos node configurations
└── tools/ # Development utilities
```
## Rule Organization
The `.cursor/rules/` directory contains specialized rules:
- **00-project-overview.mdc** (this file) - Always applied project context
- **infrastructure.mdc**: Auto-attached when working in `manifests/infrastructure/`
- **applications.mdc**: Auto-attached when working in `manifests/applications/`
- **security.mdc**: SOPS and Zero Trust patterns (auto-attached for YAML files)
- **development.mdc**: Development patterns and operational guidelines
- **troubleshooting-history.mdc**: Historical issues, migrations, and lessons learned
- **templates/**: Common configuration templates (*.yaml files)
## Key Operational Facts
- **Domain**: `keyboardvagabond.com`
- **API Endpoint**: `api.keyboardvagabond.com:6443` (Tailscale-only access)
- **Control Plane VIP**: `10.132.0.5:6443` (nodes elect primary, VIP provides HA)
- **Zero Trust**: All external services via Cloudflare tunnels (no port exposure)
- **Network**: NetCup Cloud vLAN 1004963 (10.132.0.0/24)
- **Security**: Enterprise-grade with SOPS encryption, mesh VPN, host firewall
- **Status**: Fully operational, production-ready cluster

View File

@@ -0,0 +1,124 @@
---
description: Fediverse applications deployment patterns and configurations
globs: ["manifests/applications/**/*", "build/**/*"]
alwaysApply: false
---
# Fediverse Applications ✅ OPERATIONAL
## Application Overview
All applications use **Zero Trust architecture** via Cloudflare tunnels with dedicated S3 buckets for media storage:
### Currently Deployed Applications
- **Mastodon**: `https://mastodon.keyboardvagabond.com` - Microblogging platform ✅ OPERATIONAL
- **Pixelfed**: `https://pixelfed.keyboardvagabond.com` - Photo sharing platform ✅ OPERATIONAL
- **PieFed**: `https://piefed.keyboardvagabond.com` - Forum/Reddit-like platform ✅ OPERATIONAL
- **BookWyrm**: `https://bookwyrm.keyboardvagabond.com` - Social reading platform ✅ OPERATIONAL
- **Picsur**: `https://picsur.keyboardvagabond.com` - Image storage ✅ OPERATIONAL
## Application Architecture Patterns
### Multi-Container Design
Most fediverse applications use **multi-container architecture**:
- **Web Container**: HTTP requests, API, web UI (Nginx + app server)
- **Worker Container**: Background jobs, federation, media processing
- **Beat Container**: (Django apps only) Celery Beat scheduler for periodic tasks
### Storage Strategy ✅ OPERATIONAL
**Per-Application CDN Strategy**: Each application uses dedicated Backblaze B2 bucket with Cloudflare CDN:
- **Pixelfed CDN**: `pm.keyboardvagabond.com` → `pixelfed-bucket`
- **PieFed CDN**: `pfm.keyboardvagabond.com` → `piefed-bucket`
- **Mastodon CDN**: `mm.keyboardvagabond.com` → `mastodon-bucket`
- **BookWyrm CDN**: `bm.keyboardvagabond.com` → `bookwyrm-bucket`
### Database Integration
All applications use the shared **PostgreSQL HA cluster**:
- **Connection**: `postgresql-shared-rw.postgresql-system.svc.cluster.local:5432`
- **Dedicated Databases**: Each app has its own database (e.g., `mastodon`, `pixelfed`, `piefed`, `bookwyrm`)
- **High Availability**: 3-instance cluster with automatic failover
## Framework-Specific Patterns
### Laravel Applications (Pixelfed)
```yaml
# Critical Laravel S3 Configuration
FILESYSTEM_DRIVER=s3
PF_ENABLE_CLOUD=true
FILESYSTEM_CLOUD=s3
AWS_BUCKET=pixelfed-bucket # Dedicated bucket approach
AWS_URL=https://pm.keyboardvagabond.com/ # CDN URL
```
### Flask Applications (PieFed)
```yaml
# Flask Configuration with Redis and S3
FLASK_APP=pyfedi.py
DATABASE_URL=
CACHE_REDIS_URL=
S3_BUCKET=
S3_PUBLIC_URL=https://pfm.keyboardvagabond.com
```
### Django Applications (BookWyrm)
```yaml
# Django S3 Configuration
USE_S3=true
AWS_STORAGE_BUCKET_NAME=bookwyrm-bucket
AWS_S3_CUSTOM_DOMAIN=bm.keyboardvagabond.com
AWS_DEFAULT_ACL="" # Backblaze B2 doesn't support ACLs
```
### Ruby Applications (Mastodon)
```yaml
# Mastodon Dual Ingress Pattern
# Web: mastodon.keyboardvagabond.com
# Streaming: streamingmastodon.keyboardvagabond.com (WebSocket)
STREAMING_API_BASE_URL: wss://streamingmastodon.keyboardvagabond.com
```
## Container Build Patterns
### Multi-Stage Docker Strategy ✅ WORKING
Optimized builds reduce image size by ~75%:
- **Base Image**: Shared foundation with dependencies and source code
- **Web Container**: Production web server configuration
- **Worker Container**: Background processing optimizations
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage
### Harbor Registry Integration
- **Registry**: `<YOUR_REGISTRY_URL>`
- **Image Pattern**: `<YOUR_REGISTRY_URL>/library/app-name:tag`
- **Build Process**: `./build-all.sh` in project root
## ActivityPub Inbox Rate Limiting ✅ OPERATIONAL
### Nginx Burst Configuration Pattern
Implemented across all fediverse applications to handle federation traffic spikes:
```nginx
# Rate limiting zone - 100MB buffer, 10 requests/second
limit_req_zone $binary_remote_addr zone=inbox:100m rate=10r/s;
# ActivityPub inbox location block
location /inbox {
limit_req zone=inbox burst=300; # 300 request buffer
# Extended timeouts for ActivityPub processing
}
```
### Rate Limiting Behavior
- **Normal Operation**: 10 requests/second processed immediately
- **Burst Handling**: Up to 300 additional requests queued
- **Overflow Response**: HTTP 503 when buffer exceeds capacity
- **Federation Impact**: Protects backend from overwhelming traffic spikes
## Application Deployment Standards
- **Zero Trust Ingress**: All applications use Cloudflare tunnel pattern
- **Container Registry**: Harbor for all custom images
- **Multi-Stage Builds**: Required for Python/Node.js applications
- **Storage**: Longhorn with 2-replica redundancy
- **Monitoring**: ServiceMonitor integration with OpenObserve
- **Rate Limiting**: ActivityPub inbox protection for all fediverse apps
@fediverse-app-template.yaml
@s3-storage-config-template.yaml
@activitypub-rate-limiting-template.yaml

View File

@@ -0,0 +1,140 @@
---
description: Development patterns, operational guidelines, and troubleshooting
globs: ["build/**/*", "tools/**/*", "justfile", "*.md"]
alwaysApply: false
---
# Development Patterns & Operational Guidelines
## Configuration Management
- **Kustomize**: Used for resource composition and patching via `patches/` directory
- **Helm**: Complex applications deployed via HelmRelease CRDs
- **GitOps**: All applications deployed via Flux from Git repository (`k8s-fleet` branch)
- **Staging**: Use separate branches/overlays for staging vs production environments
## Application Deployment Standards
- **Container Registry**: Use Harbor (`<YOUR_REGISTRY_URL>`) for all custom images
- **Multi-Stage Builds**: Implement for Python/Node.js applications to reduce image size by ~75%
- **Storage**: Use Longhorn with 2-replica redundancy, label volumes for S3 backup selection
- **Database**: Leverage shared PostgreSQL cluster with dedicated databases per application
- **Monitoring**: Implement ServiceMonitor for OpenObserve integration
## Email Templates & User Onboarding
- **Community Signup**: Professional welcome email template at `docs/email-templates/community-signup.html`
- **Authentik Integration**: Uses `{AUTHENTIK_URL}` placeholder for account activation links
- **Documentation**: Complete setup guide in `docs/email-templates/README.md`
- **Services Overview**: Template showcases all fediverse services with direct links
- **Branding**: Features horizontal Keyboard Vagabond logo from Picsur CDN
- **Rate Limiting**: Implement ActivityPub inbox burst protection for all fediverse applications
## Container Build Patterns
### Multi-Stage Docker Strategy ✅ WORKING
**Key Lessons Learned**:
- **Framework Identification**: Critical to identify Flask vs Django early (different command structures)
- **Python Virtual Environment**: uWSGI must use same Python version as venv
- **Static File Paths**: Flask apps with application factory have nested structure (`/app/app/static/`)
- **Database Initialization**: Flask requires explicit `flask init-db` command
- **Log File Permissions**: Non-root users need explicit ownership of log files
### Build Process
```bash
# Build all containers
./build-all.sh
# Build specific application
cd build/app-name
docker build -t <YOUR_REGISTRY_URL>/library/app-name:tag .
docker push <YOUR_REGISTRY_URL>/library/app-name:tag
```
## Key Framework Patterns
### Flask Applications (PieFed)
- **Environment Variables**: URL-based configuration (DATABASE_URL, REDIS_URL)
- **uWSGI Integration**: Install via pip in venv, not Alpine packages
- **Static Files**: Careful nginx configuration for nested structure
- **Multi-stage Builds**: Essential to remove build dependencies
### Django Applications (BookWyrm)
- **S3 Static Files**: Theme compilation before static collection
- **Celery Beat**: Single instance only (prevents duplicate scheduling)
- **ACL Configuration**: Backblaze B2 requires empty `AWS_DEFAULT_ACL`
### Laravel Applications (Pixelfed)
- **S3 Default Disk**: `DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3` required
- **Cache Invalidation**: `php artisan config:cache` after S3 changes
- **Dedicated Buckets**: Avoid prefix conflicts with dedicated bucket approach
## Operational Tools & Management
### Administrative Access ✅ SECURED
- **kubectl Context**: `admin@keyboardvagabond-tailscale` (internal VLAN IP)
- **Tailscale Client**: CGNAT range 100.64.0.0/10 access only
- **Harbor Registry**: Direct HTTPS access (Zero Trust incompatible)
### Essential Commands
```bash
# Talos cluster management (Tailscale VPN required)
talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
talosctl health
# Kubernetes cluster access
kubectl config use-context admin@keyboardvagabond-tailscale
kubectl get nodes
# SOPS secret management
sops -e -i secrets.yaml
sops -d secrets.yaml | kubectl apply -f -
# Flux GitOps management
flux get sources all
flux reconcile source git flux-system
```
### Terminal Environment Notes
- **PowerShell on macOS**: PSReadLine may display errors but commands execute successfully
- **Terminal Preference**: Use default OS terminal over PowerShell (except Windows)
- **Command Output**: Despite display issues, outputs remain readable and functional
## Scaling Preparation
- **Node Addition**: NetCup Cloud vLAN 1004963 with sequential IPs (10.132.0.x/24)
- **Storage Scaling**: Longhorn distributed across nodes with S3 backup integration
- **Load Balancing**: MetalLB or cloud load balancer integration ready
- **High Availability**: Additional control plane nodes can be added
## Troubleshooting Patterns
### Zero Trust Issues
- **Corporate VPN Blocking**: SSL handshake failures - test from different networks
- **Service Discovery**: Check label mismatch between service selector and pod labels
- **StatefulSet Issues**: Use manual Helm deployment for immutable field changes
### Common Application Issues
- **PHP Applications**: Clear Laravel config cache after environment changes
- **Flask Applications**: Verify uWSGI Python version matches venv
- **Django Applications**: Ensure theme compilation before static file collection
- **Container Builds**: Multi-stage builds reduce size but require careful dependency management
### Network & Storage Issues
- **Longhorn**: Check replica distribution across nodes
- **S3 Backup**: Verify volume labels for backup inclusion
- **Database**: Use read replicas for read-heavy operations
- **CDN**: Dedicated buckets eliminate prefix conflicts
## Performance Optimizations
- **CDN Caching**: Cloudflare cache rules for static assets (1 year cache)
- **Image Processing**: Background workers handle optimization and federation
- **Database Optimization**: Read replicas and proper indexing
- **ActivityPub Rate Limiting**: 10r/s with 300 request burst buffer
## Future Development Guidelines
- **New Services**: Zero Trust ingress pattern mandatory (no cert-manager/external-dns)
- **Security**: Never expose external ingress ports - all traffic via Cloudflare tunnels
- **CDN Strategy**: Use dedicated S3 buckets per application
- **Subdomains**: Cloudflare Free plan supports only one level (`app.domain.com`)
@development-workflow-template.yaml
@container-build-template.dockerfile
@troubleshooting-history.mdc
@talos-config-template.yaml

View File

@@ -0,0 +1,124 @@
# Fediverse Application Deployment Template
# Multi-container architecture with web, worker, and optional beat containers
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-web
namespace: app-namespace
spec:
replicas: 2
selector:
matchLabels:
app: app-name
component: web
template:
metadata:
labels:
app: app-name
component: web
spec:
containers:
- name: web
image: <YOUR_REGISTRY_URL>/library/app-name:latest
ports:
- containerPort: 8080
env:
- name: DATABASE_URL
value: "postgresql://user:password@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/app_db"
- name: REDIS_URL
value: "redis://:password@redis-ha-haproxy.redis-system.svc.cluster.local:6379/0"
- name: S3_BUCKET
value: "app-bucket"
- name: S3_CDN_URL
value: "https://cdn.keyboardvagabond.com"
envFrom:
- secretRef:
name: app-secret
- configMapRef:
name: app-config
volumeMounts:
- name: app-storage
mountPath: /app/storage
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "500m"
volumes:
- name: app-storage
persistentVolumeClaim:
claimName: app-storage-pvc
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-worker
namespace: app-namespace
spec:
replicas: 1
selector:
matchLabels:
app: app-name
component: worker
template:
metadata:
labels:
app: app-name
component: worker
spec:
containers:
- name: worker
image: <YOUR_REGISTRY_URL>/library/app-worker:latest
command: ["worker-command"] # Framework-specific worker command
env:
- name: DATABASE_URL
value: "postgresql://user:password@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/app_db"
- name: REDIS_URL
value: "redis://:password@redis-ha-haproxy.redis-system.svc.cluster.local:6379/0"
envFrom:
- secretRef:
name: app-secret
- configMapRef:
name: app-config
resources:
requests:
memory: "128Mi"
cpu: "50m"
limits:
memory: "512Mi"
cpu: "200m"
---
# Optional: Celery Beat for Django applications (single replica only)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-beat
namespace: app-namespace
spec:
replicas: 1 # CRITICAL: Never scale beyond 1 replica
strategy:
type: Recreate # Ensures only one scheduler runs
selector:
matchLabels:
app: app-name
component: beat
template:
metadata:
labels:
app: app-name
component: beat
spec:
containers:
- name: beat
image: <YOUR_REGISTRY_URL>/library/app-worker:latest
command: ["celery", "-A", "app", "beat", "-l", "info", "--scheduler", "django_celery_beat.schedulers:DatabaseScheduler"]
envFrom:
- secretRef:
name: app-secret
- configMapRef:
name: app-config

View File

@@ -0,0 +1,157 @@
---
description: Infrastructure components configuration and deployment patterns
globs: ["manifests/infrastructure/**/*", "manifests/cluster/**/*"]
alwaysApply: false
---
# Infrastructure Components ✅ OPERATIONAL
## Core Infrastructure Stack
Located in `manifests/infrastructure/`:
- **Networking**: Cilium CNI with host firewall and Hubble UI ✅ **OPERATIONAL**
- **Storage**: Longhorn distributed storage (2-replica configuration) ✅ **OPERATIONAL**
- **Ingress**: NGINX Ingress Controller with hostNetwork enabled (Zero Trust mode) ✅ **OPERATIONAL**
- **Zero Trust Tunnels**: Cloudflared deployment in `cloudflared-system` namespace ✅ **OPERATIONAL**
- **Registry**: Harbor container registry (`<YOUR_REGISTRY_URL>`) ✅ **OPERATIONAL**
- **Monitoring**: OpenTelemetry Operator + OpenObserve (O2) ✅ **OPERATIONAL**
- **Database**: PostgreSQL with CloudNativePG operator ✅ **OPERATIONAL**
- **Identity**: Authentik open-source IAM ✅ **OPERATIONAL**
- **VPN**: Tailscale mesh VPN for administrative access ✅ **OPERATIONAL**
## Component Status Matrix
### Active Components ✅ OPERATIONAL
- **Cilium**: CNI with kube-proxy replacement, host firewall
- **Longhorn**: Distributed storage with S3 backup to Backblaze B2
- **PostgreSQL**: 3-instance HA cluster with comprehensive monitoring
- **Harbor**: Container registry (direct HTTPS - Zero Trust incompatible)
- **OpenObserve**: Monitoring and observability platform
- **Authentik**: Open-source identity and access management
- **Renovate**: Automated dependency updates ✅ **ACTIVE**
### Disabled/Deprecated Components
- **external-dns**: ❌ **REMOVED** (replaced by Zero Trust tunnels)
- **cert-manager**: ❌ **REMOVED** (replaced by Cloudflare edge TLS)
- **Rook-Ceph**: ⏸️ **DISABLED** (complexity - using Longhorn instead)
- **Flux GitOps**: ⏸️ **DISABLED** (manual deployment - ready for re-activation)
### Development/Optional Components
- **Elasticsearch**: ✅ **OPERATIONAL** (log aggregation)
- **Kibana**: ✅ **OPERATIONAL** (log analytics via Zero Trust tunnel)
## Network Configuration ✅ OPERATIONAL
- **NetCup Cloud vLAN**: VLAN ID 1004963 for internal cluster communication
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
- **Node IPs** (all control plane nodes):
- n1 (152.53.107.24): Public + 10.132.0.10/24 (VLAN)
- n2 (152.53.105.81): Public + 10.132.0.20/24 (VLAN)
- n3 (152.53.200.111): Public + 10.132.0.30/24 (VLAN)
- **DNS Domain**: Uses standard `cluster.local` for maximum compatibility
- **CNI**: Cilium with kube-proxy replacement
- **Service Mesh**: Cilium with Hubble for observability
## Storage Configuration ✅ OPERATIONAL
### Longhorn Storage
- **Default Path**: `/var/lib/longhorn`
- **Replica Count**: 2 (distributed across nodes)
- **Storage Class**: `longhorn-retain` for data preservation
- **S3 Backup**: Backblaze B2 integration with label-based volume selection
### S3 Backup Configuration
- **Provider**: Backblaze B2 Cloud Storage
- **Cost**: $6/TB storage with $0 egress fees via Cloudflare partnership
- **Volume Selection**: Label-based tagging system for selective backup
- **Disaster Recovery**: Automated backup scheduling and restore capabilities
## Database Configuration ✅ OPERATIONAL
### PostgreSQL with CloudNativePG
- **Cluster Name**: `postgres-shared` in `postgresql-system` namespace
- **High Availability**: 3-instance cluster with automatic failover
- **Instances**: `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5`
- **Monitoring**: Port 9187 for comprehensive metrics export
- **Backup Strategy**: Integrated with S3 backup system via Longhorn volume labels
## Cache Configuration ✅ OPERATIONAL
### Redis HA Cluster
- **Helm Chart**: `redis-ha` from `dandydeveloper/charts` (replaced deprecated Bitnami chart)
- **Namespace**: `redis-system`
- **Architecture**: 3 Redis replicas with Sentinel for HA, 3 HAProxy pods for load balancing
- **Connection String**: `redis-ha-haproxy.redis-system.svc.cluster.local:6379`
- **HAProxy**: Provides unified read/write endpoint managed by 3 HAProxy pods
- **Storage**: Longhorn persistent volumes (20Gi per Redis instance)
- **Authentication**: SOPS-encrypted credentials in `redis-credentials` secret
- **Monitoring**: Redis exporter and HAProxy metrics via ServiceMonitor
### PostgreSQL Comprehensive Metrics ✅ OPERATIONAL
- **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}`
- **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback`
- **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit`
- **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version`
- **Security**: Role-based access control with `pg_monitor` role for metrics collection
- **Backup Integration**: Native support for WAL archiving and point-in-time recovery
- **Custom Queries**: ConfigMap-based custom query system with proper RBAC permissions
- **Dashboard Integration**: Native OpenObserve integration with predefined monitoring queries
## Security & Access Control ✅ ZERO TRUST ARCHITECTURE
### Zero Trust Migration ✅ COMPLETED
- **Migration Status**: 10 of 11 external services migrated to Cloudflare Zero Trust tunnels
- **Harbor Exception**: Direct port exposure (80/443) due to header modification issues
- **Dependencies Removed**: external-dns and cert-manager no longer needed
- **Security Improvement**: No external ingress ports exposed
### Tailscale Administrative Access ✅ IMPLEMENTED
- **Deployment Model**: Tailscale Operator Helm Chart (v1.90.x)
- **Operator**: Deployed in `tailscale-system` namespace with 2 replicas
- **Subnet Router**: Connector resource advertising internal networks (Pod: 10.244.0.0/16, Service: 10.96.0.0/12, VLAN: 10.132.0.0/24)
- **Magic DNS**: Services can be exposed via Tailscale operator with meta attributes for DNS resolution
- **OAuth Integration**: Device authentication and tagging with `tag:k8s-operator`
- **Hostname**: `keyboardvagabond-operator` for operator, `keyboardvagabond-cluster` for subnet router
## Infrastructure Deployment Patterns
### Kustomize Configuration
```yaml
# Standard kustomization.yaml structure
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: component-namespace
resources:
- namespace.yaml
- component.yaml
- monitoring.yaml
```
### Helm Integration
```yaml
# HelmRelease for complex applications
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: component-name
namespace: component-namespace
spec:
chart:
spec:
chart: chart-name
sourceRef:
kind: HelmRepository
name: repo-name
```
## Operational Procedures
### Node Addition and Scaling
When adding new nodes to the cluster, specific steps are required to ensure monitoring and metrics collection continue working properly:
- **Nginx Ingress Metrics**: See `docs/NODE-ADDITION-GUIDE.md` for complete procedures
- Nginx ingress controller deploys automatically (DaemonSet)
- OpenTelemetry collector static scrape configuration requires manual update
- Must add new node IP to targets list in `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
- Verification steps include checking metrics endpoints and collector logs
### Key Files for Node Operations
- **Monitoring Configuration**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
- **Network Policies**: `manifests/infrastructure/cluster-policies/host-fw-*.yaml`
- **Node Addition Guide**: `docs/NODE-ADDITION-GUIDE.md`
@zero-trust-ingress-template.yaml
@longhorn-storage-template.yaml
@postgresql-database-template.yaml

View File

@@ -0,0 +1,128 @@
# Longhorn Storage Templates
# Persistent volume configurations with backup labels
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-storage-pvc
namespace: app-namespace
labels:
# S3 backup inclusion labels
recurring-job.longhorn.io/backup: enabled
recurring-job-group.longhorn.io/backup: enabled
spec:
accessModes:
- ReadWriteMany # Default for applications that may scale horizontally
# Use ReadWriteOnce for:
# - Single-instance applications (databases, stateful apps)
# - CloudNativePG (manages its own storage replication)
# - Applications with file locking requirements
storageClassName: longhorn-retain # Data preservation on deletion
resources:
requests:
storage: 10Gi
---
# Longhorn StorageClass with retain policy
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-retain
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain # Preserves data on PVC deletion
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2" # 2-replica redundancy
staleReplicaTimeout: "2880" # 48 hours
fromBackup: ""
fsType: "xfs"
dataLocality: "disabled" # Allow cross-node placement
---
# Longhorn Backup Target Configuration
apiVersion: v1
kind: Secret
metadata:
name: longhorn-backup-target
namespace: longhorn-system
type: Opaque
data:
# Backblaze B2 credentials (base64 encoded, encrypted by SOPS)
AWS_ACCESS_KEY_ID: base64-encoded-key-id
AWS_SECRET_ACCESS_KEY: base64-encoded-secret-key
AWS_ENDPOINTS: aHR0cHM6Ly9zMy5ldS1jZW50cmFsLTAwMy5iYWNrYmxhemViMi5jb20= # Base64: https://s3.eu-central-003.backblazeb2.com
---
# Longhorn RecurringJob for S3 Backup
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: backup-to-s3
namespace: longhorn-system
spec:
cron: "0 2 * * *" # Daily at 2 AM
task: "backup"
groups:
- backup
retain: 7 # Keep 7 daily backups
concurrency: 2 # Concurrent backup jobs
labels:
recurring-job: backup-to-s3
---
# Volume labeling example for backup inclusion
apiVersion: v1
kind: PersistentVolume
metadata:
name: example-pv
labels:
# These labels ensure volume is included in S3 backup jobs
recurring-job.longhorn.io/backup: enabled
recurring-job-group.longhorn.io/backup: enabled
spec:
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-retain
csi:
driver: driver.longhorn.io
volumeHandle: example-volume-id
# Example: Database storage (ReadWriteOnce required)
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-storage-pvc
namespace: postgresql-system
labels:
recurring-job.longhorn.io/backup: enabled
recurring-job-group.longhorn.io/backup: enabled
spec:
accessModes:
- ReadWriteOnce # Required for databases - single writer only
storageClassName: longhorn-retain
resources:
requests:
storage: 50Gi
# Access Mode Guidelines:
# - ReadWriteMany (RWX): Default for horizontally scalable applications
# * Web applications that can run multiple pods
# * Shared file storage for multiple containers
# * Applications without file locking conflicts
#
# - ReadWriteOnce (RWO): Required for specific use cases
# * Database storage (PostgreSQL, Redis) - single writer required
# * Applications with file locking (SQLite, local file databases)
# * StatefulSets that manage their own replication
# * Single-instance applications by design
# Backup Strategy Notes:
# - Cost: $6/TB storage with $0 egress fees via Cloudflare partnership
# - Selection: Label-based tagging system for selective volume backup
# - Recovery: Automated backup scheduling and restore capabilities
# - Target: @/longhorn backup location in Backblaze B2

View File

@@ -0,0 +1,202 @@
# PostgreSQL Database Templates
# CloudNativePG cluster configuration and application integration
# Main PostgreSQL Cluster (already deployed as postgres-shared)
---
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-shared
namespace: postgresql-system
spec:
instances: 3 # High availability with automatic failover
postgresql:
parameters:
max_connections: "200"
shared_buffers: "256MB"
effective_cache_size: "1GB"
bootstrap:
initdb:
database: postgres
owner: postgres
storage:
storageClass: longhorn-retain
size: 50Gi
monitoring:
enabled: true
# Application-specific database and user creation
---
apiVersion: postgresql.cnpg.io/v1
kind: Database
metadata:
name: app-database
namespace: postgresql-system
spec:
name: app_db
owner: app_user
cluster:
name: postgres-shared
---
# Application database user secret
apiVersion: v1
kind: Secret
metadata:
name: app-postgresql-secret
namespace: app-namespace
type: Opaque
data:
# Base64 encoded credentials (encrypted by SOPS)
# Replace with actual base64-encoded values before encryption
username: <REPLACE_WITH_BASE64_ENCODED_USERNAME>
password: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
database: <REPLACE_WITH_BASE64_ENCODED_DATABASE_NAME>
---
# Connection examples for different frameworks
# Laravel/Pixelfed connection
apiVersion: v1
kind: ConfigMap
metadata:
name: laravel-db-config
data:
DB_CONNECTION: "pgsql"
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
DB_PORT: "5432"
DB_DATABASE: "pixelfed"
# Flask/PieFed connection
apiVersion: v1
kind: ConfigMap
metadata:
name: flask-db-config
data:
DATABASE_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/piefed"
# Django/BookWyrm connection
apiVersion: v1
kind: ConfigMap
metadata:
name: django-db-config
data:
POSTGRES_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
PGPORT: "5432"
POSTGRES_DB: "bookwyrm"
POSTGRES_USER: "bookwyrm_user"
# Ruby/Mastodon connection
apiVersion: v1
kind: ConfigMap
metadata:
name: mastodon-db-config
data:
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
DB_PORT: "5432"
DB_NAME: "mastodon"
DB_USER: "mastodon_user"
---
# Database monitoring ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: postgresql-metrics
namespace: postgresql-system
spec:
selector:
matchLabels:
cnpg.io/cluster: postgres-shared
endpoints:
- port: metrics
interval: 30s
path: /metrics
# Connection Patterns:
# - Read/Write: postgresql-shared-rw.postgresql-system.svc.cluster.local:5432
# - Read Only: postgresql-shared-ro.postgresql-system.svc.cluster.local:5432
# - Read Replica: postgresql-shared-r.postgresql-system.svc.cluster.local:5432
# - Monitoring: Port 9187 for comprehensive PostgreSQL metrics
# - Backup: Integrated with S3 backup system via Longhorn volume labels
# Read Replica Usage Examples:
# Mastodon - Read replicas for timeline queries and caching
apiVersion: v1
kind: ConfigMap
metadata:
name: mastodon-db-replica-config
data:
DB_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local" # Primary for writes
DB_REPLICA_HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local" # Read replica for queries
DB_PORT: "5432"
DB_NAME: "mastodon"
# Mastodon automatically uses read replicas for timeline and cache queries
# PieFed - Flask app with read/write splitting
apiVersion: v1
kind: ConfigMap
metadata:
name: piefed-db-replica-config
data:
# Primary database for writes
DATABASE_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-rw.postgresql-system.svc.cluster.local:5432/piefed"
# Read replica for heavy queries (feeds, search, analytics)
DATABASE_REPLICA_URL: "postgresql://piefed_user:<REPLACE_WITH_PASSWORD>@postgresql-shared-ro.postgresql-system.svc.cluster.local:5432/piefed"
# Authentik - Optimized performance with primary and replica load balancing
apiVersion: v1
kind: ConfigMap
metadata:
name: authentik-db-replica-config
data:
AUTHENTIK_POSTGRESQL__HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
AUTHENTIK_POSTGRESQL__PORT: "5432"
AUTHENTIK_POSTGRESQL__NAME: "authentik"
# Authentik can use read replicas for user lookups and session validation
AUTHENTIK_POSTGRESQL_REPLICA__HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local"
# BookWyrm - Django with database routing for read replicas
apiVersion: v1
kind: ConfigMap
metadata:
name: bookwyrm-db-replica-config
data:
POSTGRES_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local" # Primary
POSTGRES_REPLICA_HOST: "postgresql-shared-ro.postgresql-system.svc.cluster.local" # Read replica
PGPORT: "5432"
POSTGRES_DB: "bookwyrm"
# Django database routing can direct read queries to replica automatically
# Available Metrics:
# - Connection: cnpg_backends_total, cnpg_pg_settings_setting{name="max_connections"}
# - Performance: cnpg_pg_stat_database_xact_commit, cnpg_pg_stat_database_xact_rollback
# - Storage: cnpg_pg_database_size_bytes, cnpg_pg_stat_database_blks_hit
# - Health: cnpg_collector_up, cnpg_collector_postgres_version
# CRITICAL PostgreSQL Pod Management Safety ⚠️
# Source: https://cloudnative-pg.io/documentation/1.20/failure_modes/
# ✅ SAFE: Proper pod deletion for failover testing
# kubectl delete pod [primary-pod] --grace-period=1
# ❌ DANGEROUS: Never use grace-period=0
# kubectl delete pod [primary-pod] --grace-period=0 # NEVER DO THIS!
#
# Why grace-period=0 is dangerous:
# - Immediately removes pod from Kubernetes API without proper shutdown
# - Doesn't ensure PID 1 process (instance manager) is shut down
# - Operator triggers failover without guarantee primary was properly stopped
# - Can cause misleading results in failover simulation tests
# - Does not reflect real failure scenarios (power loss, network partition)
# Proper PostgreSQL Pod Operations:
# - Use --grace-period=1 for failover simulation tests
# - Allow CloudNativePG operator to handle automatic failover
# - Use cnpg.io/reconciliationLoop: "disabled" annotation only for emergency manual intervention
# - Always remove reconciliation disable annotation after emergency operations

View File

@@ -0,0 +1,132 @@
# S3 Storage Configuration Templates
# Framework-specific S3 integration patterns with dedicated bucket approach
# Laravel/Pixelfed S3 Configuration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: pixelfed-s3-config
data:
# Critical Laravel S3 Configuration
FILESYSTEM_DRIVER: "s3"
DANGEROUSLY_SET_FILESYSTEM_DRIVER: "s3" # Required for S3 default disk
PF_ENABLE_CLOUD: "true"
FILESYSTEM_CLOUD: "s3"
FILESYSTEM_DISK: "s3"
# Backblaze B2 S3-Compatible Storage
AWS_BUCKET: "pixelfed-bucket" # Dedicated bucket approach
AWS_URL: "<REPLACE_WITH_CDN_URL>" # CDN URL
AWS_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
AWS_ROOT: "" # Empty - no prefix needed with dedicated bucket
AWS_USE_PATH_STYLE_ENDPOINT: "false"
AWS_VISIBILITY: "public"
# Flask/PieFed S3 Configuration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: piefed-s3-config
data:
# S3 Storage (Backblaze B2)
S3_BUCKET: "piefed-bucket"
S3_REGION: "<REPLACE_WITH_S3_REGION>"
S3_ENDPOINT_URL: "<REPLACE_WITH_S3_ENDPOINT>"
S3_PUBLIC_URL: "<REPLACE_WITH_CDN_URL>"
# Django/BookWyrm S3 Configuration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: bookwyrm-s3-config
data:
# S3 Storage (Backblaze B2)
USE_S3: "true"
AWS_STORAGE_BUCKET_NAME: "bookwyrm-bucket"
AWS_S3_REGION_NAME: "<REPLACE_WITH_S3_REGION>"
AWS_S3_ENDPOINT_URL: "<REPLACE_WITH_S3_ENDPOINT>"
AWS_S3_CUSTOM_DOMAIN: "<REPLACE_WITH_CDN_DOMAIN>"
AWS_DEFAULT_ACL: "" # Backblaze B2 doesn't support ACLs
# Ruby/Mastodon S3 Configuration
---
apiVersion: v1
kind: ConfigMap
metadata:
name: mastodon-s3-config
data:
# S3 Object Storage
S3_ENABLED: "true"
S3_BUCKET: "mastodon-bucket"
S3_REGION: "<REPLACE_WITH_S3_REGION>"
S3_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
S3_HOSTNAME: "<REPLACE_WITH_S3_HOSTNAME>"
S3_ALIAS_HOST: "<REPLACE_WITH_CDN_DOMAIN>"
# Generic S3 Secret Template
---
apiVersion: v1
kind: Secret
metadata:
name: s3-credentials
type: Opaque
data:
# Base64 encoded values (will be encrypted by SOPS)
# Replace with actual base64-encoded values before encryption
AWS_ACCESS_KEY_ID: <REPLACE_WITH_BASE64_ENCODED_KEY_ID>
AWS_SECRET_ACCESS_KEY: <REPLACE_WITH_BASE64_ENCODED_SECRET_KEY>
S3_KEY: <REPLACE_WITH_BASE64_ENCODED_KEY_ID> # Flask apps use this naming
S3_SECRET: <REPLACE_WITH_BASE64_ENCODED_SECRET_KEY> # Flask apps use this naming
# CDN Mapping Reference
# | Application | CDN Subdomain | S3 Bucket | Purpose |
# |------------|---------------|-----------|---------|
# | Pixelfed | pm.keyboardvagabond.com | pixelfed-bucket | Photo/media sharing |
# | PieFed | pfm.keyboardvagabond.com | piefed-bucket | Forum content/uploads |
# | Mastodon | mm.keyboardvagabond.com | mastodon-bucket | Social media/attachments |
# | BookWyrm | bm.keyboardvagabond.com | bookwyrm-bucket | Book covers/user uploads |
# Redis Connection Pattern (HAProxy-based):
# - HAProxy (Read/Write): redis-ha-haproxy.redis-system.svc.cluster.local:6379
# - Managed by 3 HAProxy pods providing unified endpoint
# - Redis HA cluster: 3 Redis replicas with Sentinel for HA
# - Helm Chart: redis-ha from dandydeveloper/charts (replaced deprecated Bitnami)
# Redis Usage Examples:
# Mastodon - Redis for caching and Sidekiq job queue
---
apiVersion: v1
kind: ConfigMap
metadata:
name: mastodon-redis-config
data:
REDIS_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local" # HAProxy endpoint
REDIS_PORT: "6379"
# PieFed - Flask with Redis for cache and Celery broker
---
apiVersion: v1
kind: ConfigMap
metadata:
name: piefed-redis-config
data:
# All Redis connections use HAProxy endpoint
CACHE_REDIS_URL: "redis://:<REPLACE_WITH_REDIS_PASSWORD>@redis-ha-haproxy.redis-system.svc.cluster.local:6379/1"
CELERY_BROKER_URL: "redis://:<REPLACE_WITH_REDIS_PASSWORD>@redis-ha-haproxy.redis-system.svc.cluster.local:6379/2"
# BookWyrm - Django with Redis for broker and activity streams
---
apiVersion: v1
kind: ConfigMap
metadata:
name: bookwyrm-redis-config
data:
# All Redis connections use HAProxy endpoint
REDIS_BROKER_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local:6379"
REDIS_ACTIVITY_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local:6379"
REDIS_BROKER_DB_INDEX: "3"
REDIS_ACTIVITY_DB: "4"

176
.cursor/rules/security.mdc Normal file
View File

@@ -0,0 +1,176 @@
---
description: Security patterns including SOPS encryption, Zero Trust, and access control
globs: ["**/*.yaml", "machineconfigs/**/*", "secrets.yaml", "*.conf"]
alwaysApply: false
---
# Security & Encryption ✅ OPERATIONAL
## 🛡️ Maximum Security Architecture Achieved
- **🚫 Zero External Port Exposure**: No direct internet access to any cluster services
- **🔐 Dual Security Layers**: Cloudflare Zero Trust (public apps) + Tailscale Mesh VPN (admin access)
- **🌐 CGNAT-Only API Access**: Kubernetes/Talos APIs restricted to Tailscale network (100.64.0.0/10)
- **🔒 Encrypted Everything**: SOPS secrets, Zero Trust tunnels, mesh VPN connections
- **🛡️ Host Firewall**: Cilium policies blocking world access to HTTP/HTTPS ports
## SOPS Configuration ✅ OPERATIONAL
### Encryption Scope
- **Files Covered**: All YAML files in `manifests/` directory, Talos configs, machine configurations
- **Fields Encrypted**: `data` and `stringData` fields in manifests, plus specific credential fields
- **Key Management**: Multiple PGP keys configured for different components
- **Workflow**: All secrets encrypted with SOPS before Git commit
### SOPS Usage Patterns
```bash
# Encrypt new secret
sops -e -i secrets.yaml
# Edit encrypted secret
sops secrets.yaml
# Decrypt for viewing
sops -d secrets.yaml
#Decrypt in place
sops -d -i secrets.yaml
# Apply encrypted manifest
sops -d secrets.yaml | kubectl apply -f -
```
Sops encrypted files should be applied with kubectl in the unencrypted format, and encrypted before
merging into source control.
## Zero Trust Architecture ✅ MIGRATED
### Zero Trust Tunnels ✅ OPERATIONAL
- **Cloudflared Deployment**: `cloudflared-system` namespace
- **Tunnel Architecture**: Secure connectivity without exposing ingress ports
- **TLS Termination**: Cloudflare edge handles SSL/TLS
- **DNS Management**: Manual DNS record creation (external-dns removed)
### Standard Zero Trust Ingress Pattern
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
namespace: app-namespace
annotations:
# Basic NGINX Configuration only - no cert-manager or external-dns
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
spec:
ingressClassName: nginx
tls: [] # Empty - TLS handled by Cloudflare edge
rules:
- host: app.keyboardvagabond.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
```
### Migration Steps for Zero Trust
1. **Remove cert-manager annotations**: `cert-manager.io/cluster-issuer`, `cert-manager.io/issuer`
2. **Remove external-dns annotations**: `external-dns.alpha.kubernetes.io/hostname`, `external-dns.alpha.kubernetes.io/target`
3. **Empty TLS sections**: Set `tls: []` to disable certificate generation
4. **Configure Cloudflare tunnel**: Add hostname in Zero Trust dashboard
5. **Test connectivity**: Use `kubectl run curl-test` to verify internal service health
## Access Control Matrix
| **Resource** | **Public Access** | **Administrative Access** | **Security Method** |
|--------------|-------------------|---------------------------|---------------------|
| **Applications** | ✅ Cloudflare Zero Trust | ❌ Not Applicable | Authenticated tunnels |
| **Kubernetes API** | ❌ Blocked | ✅ Tailscale Mesh VPN | CGNAT + OAuth |
| **Talos API** | ❌ Blocked | ✅ Tailscale Mesh VPN | CGNAT + OAuth |
| **HTTP/HTTPS Services** | ❌ Blocked | ✅ Cluster Internal Only | Host firewall |
| **Media CDN** | ✅ Cloudflare CDN | ❌ Not Applicable | Public S3 + Edge caching |
## Tailscale Mesh VPN ✅ OPERATIONAL
### Administrative Access Configuration
- **kubectl Context**: `admin@keyboardvagabond-tailscale` using internal VLAN IP (10.132.0.10:6443)
- **Public Context**: `admin@keyboardvagabond.com` (blocked by firewall)
- **Tailscale Client**: Current IP range 100.64.0.0/10 (CGNAT)
- **Firewall Rules**: Cilium host firewall restricts API access to Tailscale network only
### Tailscale Subnet Router Configuration ✅ OPERATIONAL
- **Device Name**: `keyboardvagabond-cluster`
- **Deployment Model**: Direct deployment (not Kubernetes Operator) for simplicity
- **Advertised Networks**:
- **Pod Network**: 10.244.0.0/16 (Kubernetes pods)
- **Service Network**: 10.96.0.0/12 (Kubernetes services)
- **VLAN Network**: 10.132.0.0/24 (NetCup Cloud private network)
- **OAuth Integration**: Client credentials for device authentication and tagging
- **Device Tagging**: `tag:k8s-operator` for proper ACL management and identification
- **Network Mode**: Kernel mode (`TS_USERSPACE=false`) with privileged security context
- **State Persistence**: Kubernetes secret-based storage (`TS_KUBE_SECRET=tailscale-auth`)
- **RBAC**: Split permissions (ClusterRole for cluster resources, Role for namespace secrets)
### Tailscale Deployment Pattern
```yaml
# Direct deployment (not Kubernetes Operator)
apiVersion: apps/v1
kind: Deployment
metadata:
name: tailscale-subnet-router
spec:
template:
spec:
containers:
- name: tailscale
env:
- name: TS_KUBE_SECRET
value: tailscale-auth
- name: TS_USERSPACE
value: "false"
- name: TS_ROUTES
value: "10.244.0.0/16,10.96.0.0/12,10.132.0.0/24"
securityContext:
privileged: true
```
## Network Security ✅ OPERATIONAL
### Cilium Host Firewall
```yaml
# Host firewall blocking external access to HTTP/HTTPS
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
name: host-fw-control-plane
spec:
nodeSelector:
matchLabels:
node-role.kubernetes.io/control-plane: ""
ingress:
- fromCIDR:
- "100.64.0.0/10" # Tailscale CGNAT range only
toPorts:
- ports:
- port: "6443"
protocol: TCP
```
## Security Best Practices
- **New Services**: All applications must use Zero Trust ingress pattern
- **Harbor Exception**: Harbor registry requires direct port exposure (header modification issues)
- **Secret Management**: All secrets SOPS-encrypted before Git commit
- **Network Policies**: Cilium host firewall with CGNAT-only access
- **Administrative Access**: Tailscale mesh VPN required for kubectl/talosctl
## 🏆 Security Achievements
1. **🎯 Zero Trust Network**: No implicit trust, all access authenticated and authorized
2. **🔐 Defense in Depth**: Multiple security layers prevent single points of failure
3. **📊 Comprehensive Monitoring**: All traffic flows monitored via OpenObserve and Cilium Hubble
4. **🔄 Secure GitOps**: SOPS-encrypted secrets with PGP key management
5. **🛡️ Hardened Infrastructure**: Minimal attack surface with production-grade security controls
@sops-secret-template.yaml
@zero-trust-ingress-template.yaml
@tailscale-config-template.yaml

View File

@@ -0,0 +1,48 @@
# SOPS Secret Template
# Use this template for creating encrypted secrets
apiVersion: v1
kind: Secret
metadata:
name: app-secret
namespace: app-namespace
type: Opaque
data:
# These fields will be encrypted by SOPS
# Replace with actual base64-encoded values before encryption
DATABASE_PASSWORD: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
S3_ACCESS_KEY: <REPLACE_WITH_BASE64_ENCODED_KEY>
S3_SECRET_KEY: <REPLACE_WITH_BASE64_ENCODED_SECRET>
REDIS_PASSWORD: <REPLACE_WITH_BASE64_ENCODED_PASSWORD>
---
# ConfigMap for non-sensitive configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
namespace: app-namespace
data:
# Database connection
DATABASE_HOST: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
DATABASE_PORT: "5432"
DATABASE_NAME: "app_database"
# Redis connection
REDIS_HOST: "redis-ha-haproxy.redis-system.svc.cluster.local"
REDIS_PORT: "6379"
# S3 storage configuration
S3_BUCKET: "app-bucket"
S3_REGION: "<REPLACE_WITH_S3_REGION>"
S3_ENDPOINT: "<REPLACE_WITH_S3_ENDPOINT>"
S3_CDN_URL: "<REPLACE_WITH_CDN_URL>"
# Application settings
APP_ENV: "production"
APP_DEBUG: "false"
# SOPS encryption commands:
# sops -e -i this-file.yaml
# sops this-file.yaml # to edit
# sops -d this-file.yaml | kubectl apply -f - # to apply

View File

@@ -0,0 +1,96 @@
# Talos Configuration Templates
# Machine configurations and Talos-specific patterns
# Custom Talos Factory Image
# Uses factory image with Longhorn extension pre-installed
TALOS_FACTORY_IMAGE: "613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4"
# Network Interface Configuration
---
apiVersion: v1alpha1
kind: MachineConfig
metadata:
name: node-config
spec:
machine:
network:
interfaces:
# Public interface (DHCP + static configuration)
- interface: enp7s0
dhcp: true
addresses:
- 152.53.107.24/24 # Example for n1
routes:
- network: 0.0.0.0/0
gateway: 152.53.107.1
# Private VLAN interface (static configuration)
- interface: enp9s0
addresses:
- 10.132.0.10/24 # Example for n1 (VLAN 1004963)
vip:
ip: 10.132.0.5 # Shared VIP for control plane HA
# Node IP Configuration
machine:
kubelet:
extraArgs:
node-ip: 152.53.107.24 # Use public IP for node reporting
# Node IP Mappings (NetCup Cloud vLAN 1004963)
# All nodes are control plane nodes with shared VIP for HA
# n1: Public 152.53.107.24 + Private 10.132.0.10/24 (Control plane)
# n2: Public 152.53.105.81 + Private 10.132.0.20/24 (Control plane)
# n3: Public 152.53.200.111 + Private 10.132.0.30/24 (Control plane)
# VIP: 10.132.0.5 (shared VIP, nodes elect primary)
# Cluster Configuration
---
apiVersion: v1alpha1
kind: ClusterConfig
metadata:
name: keyboardvagabond
spec:
clusterName: keyboardvagabond.com
controlPlane:
endpoint: https://10.132.0.5:6443 # VIP endpoint for HA
# Allow workloads on control plane
allowSchedulingOnControlPlanes: true
# CNI Configuration (Cilium)
network:
cni:
name: none # Cilium installed via Helm
dnsDomain: cluster.local # Standard domain for compatibility
# API Server Configuration
apiServer:
extraArgs:
# Enable aggregation layer for metrics
enable-aggregator-routing: "true"
# Volume Configuration
# System disk: /dev/vda with 2-50GB ephemeral storage
# Longhorn storage: 400GB minimum on system disk at /var/lib/longhorn
# Administrative Access Commands
# Recommended: Use VIP endpoint for HA
# talosctl config endpoint 10.132.0.5 # VIP endpoint
# talosctl config node 10.132.0.5
# talosctl health
# talosctl dashboard (via Tailscale VPN only)
# Alternative: Individual node endpoints
# talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
# talosctl config node 10.132.0.10
# kubectl Contexts:
# - admin@keyboardvagabond-tailscale (VIP: 10.132.0.5:6443 or node IPs) - ACTIVE
# - admin@keyboardvagabond.com (blocked by firewall, Tailscale-only access)
# Security Notes:
# - API access restricted to Tailscale CGNAT range (100.64.0.0/10)
# - Cilium host firewall blocks world access to ports 6443, 50000-50010
# - All administrative access requires Tailscale mesh VPN connection
# - Backup kubeconfig available as SOPS-encrypted portable configuration

View File

@@ -0,0 +1,189 @@
---
description: Detailed technical specifications for nodes, network, and Talos configuration
globs: ["machineconfigs/**/*", "patches/**/*", "talosconfig", "kubeconfig*"]
alwaysApply: false
---
# Technical Specifications & Low-Level Configuration
## Talos Configuration ✅ OPERATIONAL
### Custom Talos Image
- **Factory Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4`, which includes two plugins necessary for Longhorn
- **Extensions**: Longhorn extension included for distributed storage
- **Version**: Talos v1.10.4 with custom factory build
- **Architecture**: ARM64 optimized for NetCup Cloud infrastructure
### Patch Configuration
Applied via `patches/` directory for cluster customization:
- **allow-controlplane-workloads.yaml**: Enables workload scheduling on control plane
- **cluster-name.yaml**: Sets cluster name to `keyboardvagabond.com`
- **disable-kube-proxy-and-cni.yaml**: Disables built-in networking for Cilium
- **etcd-patch.yaml**: etcd optimization and configuration
- **registry-patch.yaml**: Container registry configuration
- **worker-discovery-patch.yaml**: Worker node discovery settings
## Network Configuration ✅ OPERATIONAL
### NetCup Cloud Infrastructure
- **vLAN ID**: 1004963 for internal cluster communication
- **Network Range**: 10.132.0.0/24 (private VLAN)
- **DNS Domain**: `cluster.local` (standard Kubernetes domain)
- **Cluster Name**: `keyboardvagabond.com`
### Node Network Configuration
| Node | Public IP | VLAN IP | Role | Status |
|------|-----------|---------|------|--------|
| **n1** | 152.53.107.24 | 10.132.0.10/24 | Control Plane | ✅ Schedulable |
| **n2** | 152.53.105.81 | 10.132.0.20/24 | Control Plane | ✅ Schedulable |
| **n3** | 152.53.200.111 | 10.132.0.30/24 | Control Plane | ✅ Schedulable |
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
- **All nodes are control plane**: High availability with etcd quorum (2 of 3 required)
### Network Interface Configuration
- **`enp7s0`**: Public interface (DHCP + static configuration)
- **`enp9s0`**: Private VLAN interface (static configuration)
- **Internal Traffic**: Uses private VLAN for pod-to-pod and storage replication
- **External Access**: Cloudflare Zero Trust tunnels (no direct port exposure)
## Administrative Access Configuration ✅ SECURED
### Kubernetes API Access
- **Internal Context**: `admin@keyboardvagabond-tailscale`
- **VIP Endpoint**: `10.132.0.5:6443` (shared VIP, recommended for HA)
- **Node Endpoints**: `10.132.0.10:6443`, `10.132.0.20:6443`, `10.132.0.30:6443` (individual nodes)
- **Public Context**: `admin@keyboardvagabond.com` (blocked by firewall)
- **Public Endpoint**: `api.keyboardvagabond.com:6443` (Tailscale-only)
- **Access Method**: Tailscale mesh VPN required (CGNAT 100.64.0.0/10)
### Talos API Access
```bash
# Talos configuration (VIP recommended for HA)
talosctl config endpoint 10.132.0.5 # VIP endpoint
talosctl config node 10.132.0.5 # VIP node
# Alternative: Individual node endpoints
talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
talosctl config node 10.132.0.10 # Primary endpoint
```
### Essential Management Commands
```bash
# Cluster health check
talosctl health --nodes 10.132.0.10,10.132.0.20,10.132.0.30
# Node status
talosctl get members
# Kubernetes context switching
kubectl config use-context admin@keyboardvagabond-tailscale
# Node status verification
kubectl get nodes -o wide
```
## Storage Configuration Details ✅ OPERATIONAL
### Longhorn Distributed Storage
- **Installation Path**: `/var/lib/longhorn` on each node
- **Replica Policy**: 2-replica configuration across nodes
- **Storage Class**: `longhorn-retain` for data preservation
- **Node Allocation**: 400GB+ per node on system disk
- **Auto-balance**: Enabled for optimal distribution
### Volume Configuration
- **System Disk**: `/dev/vda` with ephemeral storage
- **Longhorn Volume**: 400GB minimum allocation per node
- **Backup Strategy**: Label-based S3 backup selection
- **Reclaim Policy**: Retain (prevents data loss)
## Tailscale Mesh VPN Configuration ✅ OPERATIONAL
### Tailscale Operator Deployment
- **Helm Chart**: `tailscale-operator` from Tailscale Helm repository
- **Version**: v1.90.x (operator v1.90.8)
- **Namespace**: `tailscale-system`
- **Replicas**: 2 operator pods with anti-affinity
- **Hostname**: `keyboardvagabond-operator`
### Subnet Router Configuration (Connector Resource)
- **Resource Type**: `Connector` (tailscale.com/v1alpha1)
- **Device Name**: `keyboardvagabond-cluster`
- **Advertised Networks**:
- **Pod Network**: 10.244.0.0/16
- **Service Network**: 10.96.0.0/12
- **VLAN Network**: 10.132.0.0/24
- **OAuth Integration**: Client credentials for device authentication
- **Device Tagging**: `tag:k8s-operator` for ACL management
### Service Exposure via Magic DNS
- **Capability**: Services can be exposed via Tailscale operator with meta attributes
- **Magic DNS**: Automatic DNS resolution for exposed services
- **Meta Attributes**: Can be used to configure service exposure and routing
- **Access Control**: Cilium host firewall restricts to Tailscale only
- **Current CGNAT Range**: 100.64.0.0/10 (Tailscale assigned)
## Component Status Matrix ✅ CURRENT STATE
### Active Components
| Component | Status | Access Method | Notes |
|-----------|--------|---------------|-------|
| **Cilium CNI** | ✅ Operational | Internal | Host firewall + Hubble UI |
| **Longhorn Storage** | ✅ Operational | Internal | 2-replica with S3 backup |
| **PostgreSQL HA** | ✅ Operational | Internal | 3-instance CloudNativePG |
| **Harbor Registry** | ✅ Operational | Direct HTTPS | Zero Trust incompatible |
| **OpenObserve** | ✅ Operational | Zero Trust | Monitoring platform |
| **Tailscale VPN** | ✅ Operational | Mesh Network | Administrative access |
### Disabled/Deprecated Components
| Component | Status | Reason | Alternative |
|-----------|--------|--------|-------------|
| **external-dns** | ❌ Removed | Zero Trust migration | Manual DNS in Cloudflare |
| **cert-manager** | ❌ Removed | Zero Trust migration | Cloudflare edge TLS |
| **Rook-Ceph** | ❌ Disabled | Complexity and lack of support for partitioning a single drive | Longhorn storage |
| **Flux GitOps** | ⏸️ Disabled | Manual deployment | Ready for re-activation |
### Development Components
| Component | Status | Purpose | Access |
|-----------|--------|---------|--------|
| **Renovate** | ✅ Operational | Dependency updates | Automated |
| **Elasticsearch** | ✅ Operational | Log aggregation | Internal |
| **Kibana** | ✅ Operational | Log analytics | Zero Trust |
## Network Security Configuration ✅ HARDENED
### Cilium Host Firewall Rules
```yaml
# Control plane API access (Tailscale only)
- fromCIDR: ["100.64.0.0/10"] # Tailscale CGNAT
toPorts: [{"port": "6443", "protocol": "TCP"}]
# Block world access to HTTP/HTTPS
- HTTP/HTTPS ports blocked from 0.0.0.0/0
- Only cluster-internal and Tailscale access permitted
```
### Zero Trust Architecture
- **External Applications**: All via Cloudflare tunnels
- **Administrative APIs**: Tailscale mesh VPN only
- **Harbor Exception**: Direct ports 80/443 (header modification issues)
- **Internal Services**: Cluster-local communication only
## Future Scaling Specifications
### Node Addition Process
1. **Network**: Add to NetCup Cloud vLAN 1004963
2. **IP Assignment**: Sequential (10.132.0.40/24, 10.132.0.50/24, etc.)
3. **Talos Config**: Apply machine config with proper networking
4. **Longhorn**: Automatic storage distribution across new nodes
5. **Workload**: Immediate scheduling capability
### High Availability Expansion
- **Additional Control Planes**: Can add for true HA setup
- **Load Balancing**: MetalLB or cloud LB integration ready
- **Database Scaling**: PostgreSQL can expand to more replicas
- **Storage Scaling**: Longhorn distributed across all nodes
@talos-machine-config-template.yaml
@cilium-network-policy-template.yaml
@longhorn-volume-template.yaml

View File

@@ -0,0 +1,149 @@
---
description: Historical issues, lessons learned, and troubleshooting knowledge from cluster evolution
globs: []
alwaysApply: false
---
# Troubleshooting History & Lessons Learned
This rule captures critical historical knowledge from the cluster's evolution, including resolved issues, migration challenges, and lessons learned that inform future decisions.
## 🔄 Major Architecture Migrations
### DNS Domain Evolution ✅ **RESOLVED**
- **Previous Issue**: Used custom `local.keyboardvagabond.com` domain causing compatibility problems
- **Resolution**: Reverted to standard `cluster.local` domain
- **Benefits**: Full compatibility with monitoring dashboards, service discovery, and all Kubernetes tooling
- **Lesson**: Always use standard Kubernetes domains unless absolutely necessary
### Zero Trust Migration ✅ **COMPLETED**
- **Migration Scope**: 10 of 11 external services migrated from external-dns/cert-manager to Cloudflare Zero Trust tunnels
- **Services Migrated**: Mastodon, Mastodon Streaming, Pixelfed, PieFed, Picsur, BookWyrm, Authentik, OpenObserve, Kibana, WriteFreely
- **Harbor Exception**: Harbor registry reverted to direct port exposure (80/443) due to Cloudflare header modification breaking container image layer writes
- **Dependencies Removed**: external-dns and cert-manager components no longer needed
- **Key Challenges Resolved**: Mastodon streaming subdomain compatibility, StatefulSet immutable fields, service discovery issues
## 🛠️ Historical Technical Issues
### DNS and External-DNS Resolution ✅ **RESOLVED & DEPRECATED**
- **Previous Issue**: External-DNS creating records with private VLAN IPs (10.132.0.x) which Cloudflare rejected
- **Temporary Solution**: Used `external-dns.alpha.kubernetes.io/target` annotations with public IPs
- **Target Annotations**: `152.53.107.24,152.53.105.81` were used for all ingress resources
- **Final Resolution**: **External-DNS completely removed in favor of Cloudflare Zero Trust tunnels**
- **Current Status**: Manual DNS record creation via Cloudflare Dashboard (external-dns no longer needed)
### SSL Certificate Issues ✅ **RESOLVED**
- **Previous Issue**: Let's Encrypt certificates stuck in "False/Not Ready" state due to DNS resolution failures
- **Resolution**: DNS records now resolve correctly, enabling HTTP-01 challenge completion
- **Migration**: Eventually replaced by Zero Trust architecture eliminating certificate management
### Node IP Configuration ✅ **IMPLEMENTED**
- **Approach**: Using kubelet `extraArgs` with `node-ip` parameter
- **n2 Status**: ✅ Successfully reporting public IP (152.53.105.81)
- **Backup Strategy**: Target annotations provide reliable DNS record creation regardless of node IP status
## 🔍 Framework-Specific Lessons Learned
### CDN Storage Evolution: Shared vs Dedicated Buckets
**Original Plan**: Single bucket with prefixes (`/pixelfed`, `/piefed`, `/mastodon`)
**Issue Discovered**: Pixelfed demonstrated inconsistent prefix handling, sometimes failing to return URLs with correct subdirectory
**Solution**: Dedicated buckets eliminate compatibility issues entirely
**Benefits of Dedicated Bucket Approach**:
- **Application Compatibility**: Some applications don't fully support S3 prefixes
- **No Prefix Conflicts**: Eliminates S3 path prefix issues with shared buckets
- **Simplified Configuration**: Clean S3 endpoints without complex path rewriting
- **Independent Scaling**: Each application can optimize caching independently
### Mastodon Streaming Subdomain Challenge ✅ **FIXED**
- **Original**: `streaming.mastodon.keyboardvagabond.com`
- **Issue**: Cloudflare Free plan subdomain limitation (not supported)
- **Solution**: Changed to `streamingmastodon.keyboardvagabond.com` ✅ **WORKING**
- **Lesson**: Cloudflare Free plan supports only one subdomain level (`app.domain.com` not `sub.app.domain.com`)
### Flask Application Discovery Patterns
**Critical Framework Identification**: Must identify Flask vs Django early in development
- **Flask**: Uses `flask` command, URL-based config (DATABASE_URL), application factory pattern
- **Django**: Uses `python manage.py` commands, separate host/port variables, standard project structure
- **uWSGI Integration**: Must use same Python version as venv; install via pip, not Alpine packages
- **Static Files**: Flask with application factory has nested structure (`/app/app/static/`)
### Laravel S3 Configuration Discoveries
**Critical Laravel S3 Settings**:
- **`DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3`**: Essential to make S3 the default filesystem
- **Cache Invalidation**: Must run `php artisan config:cache` after S3 (or any) configuration changes
- **Dedicated Buckets**: Prevents double-prefix issues that occur with shared buckets
### Django Static File Pipeline
**Theme Compilation Order**: Must compile themes **before** static file collection to S3
- **Correct Pipeline**: `compile_themes` → `collectstatic` → S3 upload
- **Backblaze B2**: Requires empty `AWS_DEFAULT_ACL` due to no ACL support
- **Container Builds**: Theme compilation at runtime (not build time) requires database access
## 🚨 Zero Trust Migration Issues Resolved
### Common Migration Problems
- **Mastodon Streaming**: Fixed subdomain compatibility for Cloudflare Free plan
- **OpenObserve StatefulSet**: Used manual Helm deployment to bypass immutable field restrictions
- **Picsur Service Discovery**: Fixed label mismatch between service selector and pod labels
- **Corporate VPN Blocking**: SSL handshake failures resolved by testing from different networks
### Harbor Registry Exception
**Why Harbor Can't Use Zero Trust**:
- **Issue**: Cloudflare header modification breaks container image layer writes
- **Solution**: Direct port exposure (80/443) for Harbor only
- **Security**: All other services use Zero Trust tunnels
## 🔧 Infrastructure Evolution Context
### Talos Configuration
- **Custom Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4` with Longhorn extension
- **Network Interfaces**:
- `enp7s0`: Public interface (DHCP + static configuration)
- `enp9s0`: Private VLAN interface (static configuration)
### Storage Evolution
- **Original**: Basic Longhorn setup
- **Current**: 2-replica configuration with S3 backup integration
- **Backup Strategy**: Label-based volume selection system
- **Cost Optimization**: $6/TB with $0 egress via Cloudflare partnership
### Administrative Access Evolution
- **Original**: Direct public API access
- **Migration**: Tailscale mesh VPN implementation
- **Current**: CGNAT-only access (100.64.0.0/10) via mesh network
- **Security**: Zero external API exposure
## 📊 Operational Patterns Discovered
### Multi-Stage Docker Benefits
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage builds (~75% reduction)
- **Essential for**: Python/Node.js applications to remove build dependencies
- **Pattern**: Base image → Web container → Worker container specialization
### ActivityPub Rate Limiting Implementation
**Based on**: [PieFed blog recommendations](https://join.piefed.social/2024/04/17/handling-large-bursts-of-post-requests-to-your-activitypub-inbox-using-a-buffer-in-nginx/)
- **Rate**: 10 requests/second with 300 request burst buffer
- **Memory**: 100MB zone sufficient for large-scale instances
- **Federation Impact**: Graceful handling of viral content spikes
### Terminal Environment Discovery
- **PowerShell on macOS**: PSReadLine displays errors but commands execute successfully
- **Recommendation**: Use default OS terminal over PowerShell (except Windows)
- **Functionality**: Command outputs remain readable despite display issues
## 🎯 Critical Success Factors
### What Made Migrations Successful
1. **Gradual Migration**: One service at a time instead of big-bang approach
2. **Testing Pattern**: `kubectl run curl-test` to verify internal service health
3. **Backup Strategies**: Target annotations as fallback for DNS issues
4. **Documentation**: Detailed tracking of each migration step and issue resolution
### Patterns to Avoid
1. **Custom DNS Domains**: Stick to `cluster.local` for compatibility
2. **Shared S3 Buckets**: Use dedicated buckets to avoid prefix conflicts
3. **Complex Subdomains**: Cloudflare Free plan limitations require simple patterns
4. **Single-Stage Containers**: Multi-stage builds essential for production efficiency
This historical knowledge should inform all future architectural decisions and troubleshooting approaches.

View File

@@ -0,0 +1,54 @@
# Zero Trust Ingress Template
# Use this template for all new applications deployed via Cloudflare tunnels
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
namespace: app-namespace
annotations:
# Basic NGINX Configuration only - no cert-manager or external-dns
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/backend-protocol: "HTTP"
# Optional: Extended timeouts for long-running requests
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
# Optional: ActivityPub rate limiting for fediverse applications
nginx.ingress.kubernetes.io/server-snippet: |
limit_req_zone $binary_remote_addr zone=app_inbox:100m rate=10r/s;
nginx.ingress.kubernetes.io/configuration-snippet: |
location ~* ^/(inbox|users/.*/inbox) {
limit_req zone=app_inbox burst=300;
}
spec:
ingressClassName: nginx
tls: [] # Empty - TLS handled by Cloudflare edge
rules:
- host: app.keyboardvagabond.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: app-service
port:
number: 80
---
# Service template
apiVersion: v1
kind: Service
metadata:
name: app-service
namespace: app-namespace
spec:
selector:
app: app-name
ports:
- name: http
port: 80
targetPort: 8080