Files
Keybard-Vagabond-Demo/.cursor/rules/troubleshooting-history.mdc

149 lines
8.3 KiB
Plaintext
Raw Normal View History

---
description: Historical issues, lessons learned, and troubleshooting knowledge from cluster evolution
globs: []
alwaysApply: false
---
# Troubleshooting History & Lessons Learned
This rule captures critical historical knowledge from the cluster's evolution, including resolved issues, migration challenges, and lessons learned that inform future decisions.
## 🔄 Major Architecture Migrations
### DNS Domain Evolution ✅ **RESOLVED**
- **Previous Issue**: Used custom `local.keyboardvagabond.com` domain causing compatibility problems
- **Resolution**: Reverted to standard `cluster.local` domain
- **Benefits**: Full compatibility with monitoring dashboards, service discovery, and all Kubernetes tooling
- **Lesson**: Always use standard Kubernetes domains unless absolutely necessary
### Zero Trust Migration ✅ **COMPLETED**
- **Migration Scope**: 10 of 11 external services migrated from external-dns/cert-manager to Cloudflare Zero Trust tunnels
- **Services Migrated**: Mastodon, Mastodon Streaming, Pixelfed, PieFed, Picsur, BookWyrm, Authentik, OpenObserve, Kibana, WriteFreely
- **Harbor Exception**: Harbor registry reverted to direct port exposure (80/443) due to Cloudflare header modification breaking container image layer writes
- **Dependencies Removed**: external-dns and cert-manager components no longer needed
- **Key Challenges Resolved**: Mastodon streaming subdomain compatibility, StatefulSet immutable fields, service discovery issues
## 🛠️ Historical Technical Issues
### DNS and External-DNS Resolution ✅ **RESOLVED & DEPRECATED**
- **Previous Issue**: External-DNS creating records with private VLAN IPs (10.132.0.x) which Cloudflare rejected
- **Temporary Solution**: Used `external-dns.alpha.kubernetes.io/target` annotations with public IPs
- **Target Annotations**: `152.53.107.24,152.53.105.81` were used for all ingress resources
- **Final Resolution**: **External-DNS completely removed in favor of Cloudflare Zero Trust tunnels**
- **Current Status**: Manual DNS record creation via Cloudflare Dashboard (external-dns no longer needed)
### SSL Certificate Issues ✅ **RESOLVED**
- **Previous Issue**: Let's Encrypt certificates stuck in "False/Not Ready" state due to DNS resolution failures
- **Resolution**: DNS records now resolve correctly, enabling HTTP-01 challenge completion
- **Migration**: Eventually replaced by Zero Trust architecture eliminating certificate management
### Node IP Configuration ✅ **IMPLEMENTED**
- **Approach**: Using kubelet `extraArgs` with `node-ip` parameter
- **n2 Status**: ✅ Successfully reporting public IP (152.53.105.81)
- **Backup Strategy**: Target annotations provide reliable DNS record creation regardless of node IP status
## 🔍 Framework-Specific Lessons Learned
### CDN Storage Evolution: Shared vs Dedicated Buckets
**Original Plan**: Single bucket with prefixes (`/pixelfed`, `/piefed`, `/mastodon`)
**Issue Discovered**: Pixelfed demonstrated inconsistent prefix handling, sometimes failing to return URLs with correct subdirectory
**Solution**: Dedicated buckets eliminate compatibility issues entirely
**Benefits of Dedicated Bucket Approach**:
- **Application Compatibility**: Some applications don't fully support S3 prefixes
- **No Prefix Conflicts**: Eliminates S3 path prefix issues with shared buckets
- **Simplified Configuration**: Clean S3 endpoints without complex path rewriting
- **Independent Scaling**: Each application can optimize caching independently
### Mastodon Streaming Subdomain Challenge ✅ **FIXED**
- **Original**: `streaming.mastodon.keyboardvagabond.com`
- **Issue**: Cloudflare Free plan subdomain limitation (not supported)
- **Solution**: Changed to `streamingmastodon.keyboardvagabond.com` ✅ **WORKING**
- **Lesson**: Cloudflare Free plan supports only one subdomain level (`app.domain.com` not `sub.app.domain.com`)
### Flask Application Discovery Patterns
**Critical Framework Identification**: Must identify Flask vs Django early in development
- **Flask**: Uses `flask` command, URL-based config (DATABASE_URL), application factory pattern
- **Django**: Uses `python manage.py` commands, separate host/port variables, standard project structure
- **uWSGI Integration**: Must use same Python version as venv; install via pip, not Alpine packages
- **Static Files**: Flask with application factory has nested structure (`/app/app/static/`)
### Laravel S3 Configuration Discoveries
**Critical Laravel S3 Settings**:
- **`DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3`**: Essential to make S3 the default filesystem
- **Cache Invalidation**: Must run `php artisan config:cache` after S3 (or any) configuration changes
- **Dedicated Buckets**: Prevents double-prefix issues that occur with shared buckets
### Django Static File Pipeline
**Theme Compilation Order**: Must compile themes **before** static file collection to S3
- **Correct Pipeline**: `compile_themes` → `collectstatic` → S3 upload
- **Backblaze B2**: Requires empty `AWS_DEFAULT_ACL` due to no ACL support
- **Container Builds**: Theme compilation at runtime (not build time) requires database access
## 🚨 Zero Trust Migration Issues Resolved
### Common Migration Problems
- **Mastodon Streaming**: Fixed subdomain compatibility for Cloudflare Free plan
- **OpenObserve StatefulSet**: Used manual Helm deployment to bypass immutable field restrictions
- **Picsur Service Discovery**: Fixed label mismatch between service selector and pod labels
- **Corporate VPN Blocking**: SSL handshake failures resolved by testing from different networks
### Harbor Registry Exception
**Why Harbor Can't Use Zero Trust**:
- **Issue**: Cloudflare header modification breaks container image layer writes
- **Solution**: Direct port exposure (80/443) for Harbor only
- **Security**: All other services use Zero Trust tunnels
## 🔧 Infrastructure Evolution Context
### Talos Configuration
- **Custom Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4` with Longhorn extension
- **Network Interfaces**:
- `enp7s0`: Public interface (DHCP + static configuration)
- `enp9s0`: Private VLAN interface (static configuration)
### Storage Evolution
- **Original**: Basic Longhorn setup
- **Current**: 2-replica configuration with S3 backup integration
- **Backup Strategy**: Label-based volume selection system
- **Cost Optimization**: $6/TB with $0 egress via Cloudflare partnership
### Administrative Access Evolution
- **Original**: Direct public API access
- **Migration**: Tailscale mesh VPN implementation
- **Current**: CGNAT-only access (100.64.0.0/10) via mesh network
- **Security**: Zero external API exposure
## 📊 Operational Patterns Discovered
### Multi-Stage Docker Benefits
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage builds (~75% reduction)
- **Essential for**: Python/Node.js applications to remove build dependencies
- **Pattern**: Base image → Web container → Worker container specialization
### ActivityPub Rate Limiting Implementation
**Based on**: [PieFed blog recommendations](https://join.piefed.social/2024/04/17/handling-large-bursts-of-post-requests-to-your-activitypub-inbox-using-a-buffer-in-nginx/)
- **Rate**: 10 requests/second with 300 request burst buffer
- **Memory**: 100MB zone sufficient for large-scale instances
- **Federation Impact**: Graceful handling of viral content spikes
### Terminal Environment Discovery
- **PowerShell on macOS**: PSReadLine displays errors but commands execute successfully
- **Recommendation**: Use default OS terminal over PowerShell (except Windows)
- **Functionality**: Command outputs remain readable despite display issues
## 🎯 Critical Success Factors
### What Made Migrations Successful
1. **Gradual Migration**: One service at a time instead of big-bang approach
2. **Testing Pattern**: `kubectl run curl-test` to verify internal service health
3. **Backup Strategies**: Target annotations as fallback for DNS issues
4. **Documentation**: Detailed tracking of each migration step and issue resolution
### Patterns to Avoid
1. **Custom DNS Domains**: Stick to `cluster.local` for compatibility
2. **Shared S3 Buckets**: Use dedicated buckets to avoid prefix conflicts
3. **Complex Subdomains**: Cloudflare Free plan limitations require simple patterns
4. **Single-Stage Containers**: Multi-stage builds essential for production efficiency
This historical knowledge should inform all future architectural decisions and troubleshooting approaches.