149 lines
8.3 KiB
Plaintext
149 lines
8.3 KiB
Plaintext
---
|
|
description: Historical issues, lessons learned, and troubleshooting knowledge from cluster evolution
|
|
globs: []
|
|
alwaysApply: false
|
|
---
|
|
|
|
# Troubleshooting History & Lessons Learned
|
|
|
|
This rule captures critical historical knowledge from the cluster's evolution, including resolved issues, migration challenges, and lessons learned that inform future decisions.
|
|
|
|
## 🔄 Major Architecture Migrations
|
|
|
|
### DNS Domain Evolution ✅ **RESOLVED**
|
|
- **Previous Issue**: Used custom `local.keyboardvagabond.com` domain causing compatibility problems
|
|
- **Resolution**: Reverted to standard `cluster.local` domain
|
|
- **Benefits**: Full compatibility with monitoring dashboards, service discovery, and all Kubernetes tooling
|
|
- **Lesson**: Always use standard Kubernetes domains unless absolutely necessary
|
|
|
|
### Zero Trust Migration ✅ **COMPLETED**
|
|
- **Migration Scope**: 10 of 11 external services migrated from external-dns/cert-manager to Cloudflare Zero Trust tunnels
|
|
- **Services Migrated**: Mastodon, Mastodon Streaming, Pixelfed, PieFed, Picsur, BookWyrm, Authentik, OpenObserve, Kibana, WriteFreely
|
|
- **Harbor Exception**: Harbor registry reverted to direct port exposure (80/443) due to Cloudflare header modification breaking container image layer writes
|
|
- **Dependencies Removed**: external-dns and cert-manager components no longer needed
|
|
- **Key Challenges Resolved**: Mastodon streaming subdomain compatibility, StatefulSet immutable fields, service discovery issues
|
|
|
|
## 🛠️ Historical Technical Issues
|
|
|
|
### DNS and External-DNS Resolution ✅ **RESOLVED & DEPRECATED**
|
|
- **Previous Issue**: External-DNS creating records with private VLAN IPs (10.132.0.x) which Cloudflare rejected
|
|
- **Temporary Solution**: Used `external-dns.alpha.kubernetes.io/target` annotations with public IPs
|
|
- **Target Annotations**: `152.53.107.24,152.53.105.81` were used for all ingress resources
|
|
- **Final Resolution**: **External-DNS completely removed in favor of Cloudflare Zero Trust tunnels**
|
|
- **Current Status**: Manual DNS record creation via Cloudflare Dashboard (external-dns no longer needed)
|
|
|
|
### SSL Certificate Issues ✅ **RESOLVED**
|
|
- **Previous Issue**: Let's Encrypt certificates stuck in "False/Not Ready" state due to DNS resolution failures
|
|
- **Resolution**: DNS records now resolve correctly, enabling HTTP-01 challenge completion
|
|
- **Migration**: Eventually replaced by Zero Trust architecture eliminating certificate management
|
|
|
|
### Node IP Configuration ✅ **IMPLEMENTED**
|
|
- **Approach**: Using kubelet `extraArgs` with `node-ip` parameter
|
|
- **n2 Status**: ✅ Successfully reporting public IP (152.53.105.81)
|
|
- **Backup Strategy**: Target annotations provide reliable DNS record creation regardless of node IP status
|
|
|
|
## 🔍 Framework-Specific Lessons Learned
|
|
|
|
### CDN Storage Evolution: Shared vs Dedicated Buckets
|
|
**Original Plan**: Single bucket with prefixes (`/pixelfed`, `/piefed`, `/mastodon`)
|
|
**Issue Discovered**: Pixelfed demonstrated inconsistent prefix handling, sometimes failing to return URLs with correct subdirectory
|
|
**Solution**: Dedicated buckets eliminate compatibility issues entirely
|
|
|
|
**Benefits of Dedicated Bucket Approach**:
|
|
- **Application Compatibility**: Some applications don't fully support S3 prefixes
|
|
- **No Prefix Conflicts**: Eliminates S3 path prefix issues with shared buckets
|
|
- **Simplified Configuration**: Clean S3 endpoints without complex path rewriting
|
|
- **Independent Scaling**: Each application can optimize caching independently
|
|
|
|
### Mastodon Streaming Subdomain Challenge ✅ **FIXED**
|
|
- **Original**: `streaming.mastodon.keyboardvagabond.com`
|
|
- **Issue**: Cloudflare Free plan subdomain limitation (not supported)
|
|
- **Solution**: Changed to `streamingmastodon.keyboardvagabond.com` ✅ **WORKING**
|
|
- **Lesson**: Cloudflare Free plan supports only one subdomain level (`app.domain.com` not `sub.app.domain.com`)
|
|
|
|
### Flask Application Discovery Patterns
|
|
**Critical Framework Identification**: Must identify Flask vs Django early in development
|
|
- **Flask**: Uses `flask` command, URL-based config (DATABASE_URL), application factory pattern
|
|
- **Django**: Uses `python manage.py` commands, separate host/port variables, standard project structure
|
|
- **uWSGI Integration**: Must use same Python version as venv; install via pip, not Alpine packages
|
|
- **Static Files**: Flask with application factory has nested structure (`/app/app/static/`)
|
|
|
|
### Laravel S3 Configuration Discoveries
|
|
**Critical Laravel S3 Settings**:
|
|
- **`DANGEROUSLY_SET_FILESYSTEM_DRIVER=s3`**: Essential to make S3 the default filesystem
|
|
- **Cache Invalidation**: Must run `php artisan config:cache` after S3 (or any) configuration changes
|
|
- **Dedicated Buckets**: Prevents double-prefix issues that occur with shared buckets
|
|
|
|
### Django Static File Pipeline
|
|
**Theme Compilation Order**: Must compile themes **before** static file collection to S3
|
|
- **Correct Pipeline**: `compile_themes` → `collectstatic` → S3 upload
|
|
- **Backblaze B2**: Requires empty `AWS_DEFAULT_ACL` due to no ACL support
|
|
- **Container Builds**: Theme compilation at runtime (not build time) requires database access
|
|
|
|
## 🚨 Zero Trust Migration Issues Resolved
|
|
|
|
### Common Migration Problems
|
|
- **Mastodon Streaming**: Fixed subdomain compatibility for Cloudflare Free plan
|
|
- **OpenObserve StatefulSet**: Used manual Helm deployment to bypass immutable field restrictions
|
|
- **Picsur Service Discovery**: Fixed label mismatch between service selector and pod labels
|
|
- **Corporate VPN Blocking**: SSL handshake failures resolved by testing from different networks
|
|
|
|
### Harbor Registry Exception
|
|
**Why Harbor Can't Use Zero Trust**:
|
|
- **Issue**: Cloudflare header modification breaks container image layer writes
|
|
- **Solution**: Direct port exposure (80/443) for Harbor only
|
|
- **Security**: All other services use Zero Trust tunnels
|
|
|
|
## 🔧 Infrastructure Evolution Context
|
|
|
|
### Talos Configuration
|
|
- **Custom Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4` with Longhorn extension
|
|
- **Network Interfaces**:
|
|
- `enp7s0`: Public interface (DHCP + static configuration)
|
|
- `enp9s0`: Private VLAN interface (static configuration)
|
|
|
|
### Storage Evolution
|
|
- **Original**: Basic Longhorn setup
|
|
- **Current**: 2-replica configuration with S3 backup integration
|
|
- **Backup Strategy**: Label-based volume selection system
|
|
- **Cost Optimization**: $6/TB with $0 egress via Cloudflare partnership
|
|
|
|
### Administrative Access Evolution
|
|
- **Original**: Direct public API access
|
|
- **Migration**: Tailscale mesh VPN implementation
|
|
- **Current**: CGNAT-only access (100.64.0.0/10) via mesh network
|
|
- **Security**: Zero external API exposure
|
|
|
|
## 📊 Operational Patterns Discovered
|
|
|
|
### Multi-Stage Docker Benefits
|
|
- **Size Reduction**: From 1.3GB single-stage to ~350MB multi-stage builds (~75% reduction)
|
|
- **Essential for**: Python/Node.js applications to remove build dependencies
|
|
- **Pattern**: Base image → Web container → Worker container specialization
|
|
|
|
### ActivityPub Rate Limiting Implementation
|
|
**Based on**: [PieFed blog recommendations](https://join.piefed.social/2024/04/17/handling-large-bursts-of-post-requests-to-your-activitypub-inbox-using-a-buffer-in-nginx/)
|
|
- **Rate**: 10 requests/second with 300 request burst buffer
|
|
- **Memory**: 100MB zone sufficient for large-scale instances
|
|
- **Federation Impact**: Graceful handling of viral content spikes
|
|
|
|
### Terminal Environment Discovery
|
|
- **PowerShell on macOS**: PSReadLine displays errors but commands execute successfully
|
|
- **Recommendation**: Use default OS terminal over PowerShell (except Windows)
|
|
- **Functionality**: Command outputs remain readable despite display issues
|
|
|
|
## 🎯 Critical Success Factors
|
|
|
|
### What Made Migrations Successful
|
|
1. **Gradual Migration**: One service at a time instead of big-bang approach
|
|
2. **Testing Pattern**: `kubectl run curl-test` to verify internal service health
|
|
3. **Backup Strategies**: Target annotations as fallback for DNS issues
|
|
4. **Documentation**: Detailed tracking of each migration step and issue resolution
|
|
|
|
### Patterns to Avoid
|
|
1. **Custom DNS Domains**: Stick to `cluster.local` for compatibility
|
|
2. **Shared S3 Buckets**: Use dedicated buckets to avoid prefix conflicts
|
|
3. **Complex Subdomains**: Cloudflare Free plan limitations require simple patterns
|
|
4. **Single-Stage Containers**: Multi-stage builds essential for production efficiency
|
|
|
|
This historical knowledge should inform all future architectural decisions and troubleshooting approaches. |