# Longhorn S3 API Call Reduction - Final Solution ## Problem Summary Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs. ## Root Cause Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls. ## Solution History ### Attempt 1: NetworkPolicy-Based Access Control ❌ **Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM). **Why It Failed**: - NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server - Longhorn manager pods couldn't perform leader election or webhook operations - Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout` - Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive - Cilium/NetworkPolicy interaction complexity made it unreliable **Files Created** (kept for reference): - `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions - Removed from `kustomization.yaml` but retained in repository ## Final Solution: Increased Poll Interval ✅ ### Implementation **Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`. **Location**: `manifests/infrastructure/longhorn/config-map.yaml` ```yaml data: default-resource.yaml: |- "backup-target": "s3://@/longhorn-backup" "backup-target-credential-secret": "backblaze-credentials" "backupstore-poll-interval": "86400" # 24 hours "virtual-hosted-style": "true" ``` ### Why This Works 1. **Dramatic Reduction**: Polling happens once per day instead of continuously 2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally 3. **Simple**: No complex NetworkPolicies or CronJobs to manage 4. **Reliable**: Well-tested Longhorn configuration option 5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs ### Expected Results | Metric | Before | After | Improvement | |--------|--------|-------|-------------| | **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** | | **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 | | **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ | | **System Stability** | Affected by NetworkPolicy | Stable | ✅ | ## Current Status ✅ **Applied**: ConfigMap updated with `backupstore-poll-interval: 86400` ✅ **Verified**: Longhorn manager pods are 2/2 Ready ✅ **Backups**: Continue working normally via recurring jobs ✅ **Monitoring**: Backblaze API usage should drop to <1,000 calls/day ## Monitoring ### Check Longhorn Manager Health ```bash kubectl get pods -n longhorn-system -l app=longhorn-manager # Should show: 2/2 Ready for all pods ``` ### Check Poll Interval Setting ```bash kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval # Should show: "backupstore-poll-interval": "86400" ``` ### Check Backups Continue Working ```bash kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10 # Should see recent backups with "Completed" status ``` ### Monitor Backblaze API Usage 1. Log into Backblaze B2 dashboard 2. Navigate to "Caps and Alerts" 3. Check "Class C Transactions" (includes `s3_list_objects`) 4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours ## Backup Schedule (Unchanged) | Type | Schedule | Retention | |------|----------|-----------| | **Daily** | 2:00 AM | 7 days | | **Weekly** | 1:00 AM Sundays | 4 weeks | Backups are triggered by `RecurringJob` resources, not by polling. ## Why Polling Isn't Critical **Longhorn's backupstore polling is primarily for**: - Disaster Recovery (DR) volumes that need continuous sync - Detecting backups created outside the cluster **We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so: - ✅ Once-daily polling is more than sufficient - ✅ Backups work independently of polling frequency - ✅ Manual backups via Longhorn UI still work immediately ## Troubleshooting ### If Pods Show 1/2 Ready **Symptom**: Longhorn manager pods stuck at 1/2 Ready **Cause**: NetworkPolicy may have been accidentally applied **Solution**: ```bash # Check for NetworkPolicy kubectl get networkpolicy -n longhorn-system # If found, delete it kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access # Wait 30 seconds sleep 30 # Verify pods recover kubectl get pods -n longhorn-system -l app=longhorn-manager ``` ### If S3 API Calls Remain High **Check poll interval is applied**: ```bash kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml ``` **Restart Longhorn managers to pick up changes**: ```bash kubectl rollout restart daemonset -n longhorn-system longhorn-manager ``` ### If Backups Fail Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur: ```bash # Check recurring jobs kubectl get recurringjobs -n longhorn-system # Check recent backup jobs kubectl get jobs -n longhorn-system | grep backup # Check backup target connectivity (should work anytime) MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}') kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https:// ``` ## References - [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue - [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/) - Longhorn version: v1.9.0 ## Files Modified 1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400 2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference 3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied) 4. ✅ `S3-API-SOLUTION-FINAL.md` - This document ## Lessons Learned 1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication 2. **Start simple**: Configuration-based solutions are often more reliable than complex automation 3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies 4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use 5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary ## Success Metrics Monitor these over the next week: - ✅ Longhorn manager pods: 2/2 Ready - ✅ Daily backups: Completing successfully - ✅ S3 API calls: <1,000/day (down from 145,000) - ✅ Backblaze costs: Within free tier - ✅ No manual intervention required