--- description: Historical issues, lessons learned, and troubleshooting knowledge from cluster evolution globs: [] alwaysApply: false --- # Troubleshooting History & Lessons Learned This rule captures critical historical knowledge from the cluster's evolution, including resolved issues, migration challenges, and lessons learned that inform future decisions. ## 🔄 Major Architecture Migrations ### DNS Domain Evolution ✅ **RESOLVED** - **Previous Issue**: Used custom `local.keyboardvagabond.com` domain causing compatibility problems - **Resolution**: Reverted to standard `cluster.local` domain - **Benefits**: Full compatibility with monitoring dashboards, service discovery, and all Kubernetes tooling - **Lesson**: Always use standard Kubernetes domains unless absolutely necessary ### Zero Trust Migration ✅ **COMPLETED** - **Migration Scope**: 10 of 11 external services migrated from external-dns/cert-manager to Cloudflare Zero Trust tunnels - **Services Migrated**: Mastodon, Mastodon Streaming, Pixelfed, PieFed, Picsur, BookWyrm, Authentik, OpenObserve, Kibana, WriteFreely - **Harbor Exception**: Harbor registry reverted to direct port exposure (80/443) due to Cloudflare header modification breaking container image layer writes - **Dependencies Removed**: external-dns and cert-manager components no longer needed - **Key Challenges Resolved**: Mastodon streaming subdomain compatibility, StatefulSet immutable fields, service discovery issues ## 🛠️ Historical Technical Issues ### DNS and External-DNS Resolution ✅ **RESOLVED & DEPRECATED** - **Previous Issue**: External-DNS creating records with private VLAN IPs (10.132.0.x) which Cloudflare rejected - **Temporary Solution**: Used `external-dns.alpha.kubernetes.io/target` annotations with public IPs - **Target Annotations**: `152.53.107.24,152.53.105.81` were used for all ingress resources - **Final Resolution**: **External-DNS completely removed in favor of Cloudflare Zero Trust tunnels** - **Current Status**: Manual DNS record creation via Cloudflare Dashboard (external-dns no longer needed) ### SSL Certificate Issues ✅ **RESOLVED** - **Previous Issue**: Let's Encrypt certificates stuck in "False/Not Ready" state due to DNS resolution failures - **Resolution**: DNS records now resolve correctly, enabling HTTP-01 challenge completion - **Migration**: Eventually replaced by Zero Trust architecture eliminating certificate management ### Node IP Configuration ✅ **IMPLEMENTED** - **Approach**: Using kubelet `extraArgs` with `node-ip` parameter - **n2 Status**: ✅ Successfully reporting public IP (152.53.105.81) - **Backup Strategy**: Target annotations provide reliable DNS record creation regardless of node IP status ## 🔍 Framework-Specific Lessons Learned ### CDN Storage Evolution: Shared vs Dedicated Buckets **Original Plan**: Single bucket with prefixes (`/pixelfed`, `/piefed`, `/mastodon`) **Issue Discovered**: Pixelfed demonstrated inconsistent prefix handling, sometimes failing to return URLs with correct subdirectory **Solution**: Dedicated buckets eliminate compatibility issues entirely **Benefits of Dedicated Bucket Approach**: - **Application Compatibility**: Some applications don't fully support S3 prefixes - **No Prefix Conflicts**: Eliminates S3 path prefix issues with shared buckets - **Simplified Configuration**: Clean S3 endpoints without complex path rewriting - **Independent Scaling**: Each application can optimize caching independently ### Mastodon Streaming Subdomain Challenge ✅ **FIXED** - **Original**: `streaming.mastodon.keyboardvagabond.com` - **Issue**: Cloudflare Free plan subdomain limitation (not supported) - **Solution**: Changed to `streamingmastodon.keyboardvagabond.com` ✅ **WORKING** - **Lesson**: Cloudflare Free plan supports only one subdomain level (`app.domain.com` not `sub.app.domain.com`) ### Flask Application Discovery Patterns **Critical Framework Identification**: Must identify Flask vs Django early in development - **Flask**: Uses `flask` command, URL-based config (DATABASE_URL), application factory pattern - **Django**: Uses `python manage.py` commands, separate host/port variables, standard project structure - **uWSGI Integration**: Must use same Python version as venv; install via pip, not Alpine packages - **Static Files**: Flask with application factory has nested structure (`/app/app/static/`) ### Laravel S3 Configuration Discoveries **Critical Laravel S3 Settings**: - **`DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3`**: Essential to make S3 the default filesystem - **Cache Invalidation**: Must run `php artisan config:cache` after S3 (or any) configuration changes - **Dedicated Buckets**: Prevents double-prefix issues that occur with shared buckets ### Django Static File Pipeline **Theme Compilation Order**: Must compile themes **before** static file collection to S3 - **Correct Pipeline**: `compile_themes` → `collectstatic` → S3 upload - **Backblaze B2**: Requires empty `AWS_DEFAULT_ACL` due to no ACL support - **Container Builds**: Theme compilation at runtime (not build time) requires database access ## 🚨 Zero Trust Migration Issues Resolved ### Common Migration Problems - **Mastodon Streaming**: Fixed subdomain compatibility for Cloudflare Free plan - **OpenObserve StatefulSet**: Used manual Helm deployment to bypass immutable field restrictions - **Picsur Service Discovery**: Fixed label mismatch between service selector and pod labels - **Corporate VPN Blocking**: SSL handshake failures resolved by testing from different networks ### Harbor Registry Exception **Why Harbor Can't Use Zero Trust**: - **Issue**: Cloudflare header modification breaks container image layer writes - **Solution**: Direct port exposure (80/443) for Harbor only - **Security**: All other services use Zero Trust tunnels ## 🔧 Infrastructure Evolution Context ### Talos Configuration - **Custom Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4` with Longhorn extension - **Network Interfaces**: - `enp7s0`: Public interface (DHCP + static configuration) - `enp9s0`: Private VLAN interface (static configuration) ### Storage Evolution - **Original**: Basic Longhorn setup - **Current**: 2-replica configuration with S3 backup integration - **Backup Strategy**: Label-based volume selection system - **Cost Optimization**: $6/TB with $0 egress via Cloudflare partnership ### Administrative Access Evolution - **Original**: Direct public API access - **Migration**: Tailscale mesh VPN implementation - **Current**: CGNAT-only access (100.64.0.0/10) via mesh network - **Security**: Zero external API exposure ## 📊 Operational Patterns Discovered ### Multi-Stage Docker Benefits - **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage builds (~75% reduction) - **Essential for**: Python/Node.js applications to remove build dependencies - **Pattern**: Base image → Web container → Worker container specialization ### ActivityPub Rate Limiting Implementation **Based on**: [PieFed blog recommendations](https://join.piefed.social/2024/04/17/handling-large-bursts-of-post-requests-to-your-activitypub-inbox-using-a-buffer-in-nginx/) - **Rate**: 10 requests/second with 300 request burst buffer - **Memory**: 100MB zone sufficient for large-scale instances - **Federation Impact**: Graceful handling of viral content spikes ### Terminal Environment Discovery - **PowerShell on macOS**: PSReadLine displays errors but commands execute successfully - **Recommendation**: Use default OS terminal over PowerShell (except Windows) - **Functionality**: Command outputs remain readable despite display issues ## 🎯 Critical Success Factors ### What Made Migrations Successful 1. **Gradual Migration**: One service at a time instead of big-bang approach 2. **Testing Pattern**: `kubectl run curl-test` to verify internal service health 3. **Backup Strategies**: Target annotations as fallback for DNS issues 4. **Documentation**: Detailed tracking of each migration step and issue resolution ### Patterns to Avoid 1. **Custom DNS Domains**: Stick to `cluster.local` for compatibility 2. **Shared S3 Buckets**: Use dedicated buckets to avoid prefix conflicts 3. **Complex Subdomains**: Cloudflare Free plan limitations require simple patterns 4. **Single-Stage Containers**: Multi-stage builds essential for production efficiency This historical knowledge should inform all future architectural decisions and troubleshooting approaches.