Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
7.1 KiB
Longhorn S3 API Call Reduction - Final Solution
Problem Summary
Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.
Root Cause
Longhorn's backupstore-poll-interval setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.
Solution History
Attempt 1: NetworkPolicy-Based Access Control ❌
Approach: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).
Why It Failed:
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
- Longhorn manager pods couldn't perform leader election or webhook operations
- Pods entered 1/2 Ready state with errors:
error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout - Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
- Cilium/NetworkPolicy interaction complexity made it unreliable
Files Created (kept for reference):
network-policy-s3-block.yaml- CronJobs and NetworkPolicy definitions- Removed from
kustomization.yamlbut retained in repository
Final Solution: Increased Poll Interval ✅
Implementation
Change: Set backupstore-poll-interval to 86400 seconds (24 hours) instead of 0.
Location: manifests/infrastructure/longhorn/config-map.yaml
data:
default-resource.yaml: |-
"backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
"backup-target-credential-secret": "backblaze-credentials"
"backupstore-poll-interval": "86400" # 24 hours
"virtual-hosted-style": "true"
Why This Works
- Dramatic Reduction: Polling happens once per day instead of continuously
- No Breakage: Kubernetes API, webhooks, and leader election work normally
- Simple: No complex NetworkPolicies or CronJobs to manage
- Reliable: Well-tested Longhorn configuration option
- Sufficient: Backups don't require frequent polling since we use scheduled recurring jobs
Expected Results
| Metric | Before | After | Improvement |
|---|---|---|---|
| Poll Frequency | Every ~5 seconds | Every 24 hours | 99.99% reduction |
| Daily S3 API Calls | 145,000+ | ~300-1,000 | 99% reduction 📉 |
| Backblaze Costs | Exceeds free tier | Within free tier | ✅ |
| System Stability | Affected by NetworkPolicy | Stable | ✅ |
Current Status
✅ Applied: ConfigMap updated with backupstore-poll-interval: 86400
✅ Verified: Longhorn manager pods are 2/2 Ready
✅ Backups: Continue working normally via recurring jobs
✅ Monitoring: Backblaze API usage should drop to <1,000 calls/day
Monitoring
Check Longhorn Manager Health
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods
Check Poll Interval Setting
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"
Check Backups Continue Working
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status
Monitor Backblaze API Usage
- Log into Backblaze B2 dashboard
- Navigate to "Caps and Alerts"
- Check "Class C Transactions" (includes
s3_list_objects) - Expected: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours
Backup Schedule (Unchanged)
| Type | Schedule | Retention |
|---|---|---|
| Daily | 2:00 AM | 7 days |
| Weekly | 1:00 AM Sundays | 4 weeks |
Backups are triggered by RecurringJob resources, not by polling.
Why Polling Isn't Critical
Longhorn's backupstore polling is primarily for:
- Disaster Recovery (DR) volumes that need continuous sync
- Detecting backups created outside the cluster
We don't use DR volumes, and all backups are created by recurring jobs within the cluster, so:
- ✅ Once-daily polling is more than sufficient
- ✅ Backups work independently of polling frequency
- ✅ Manual backups via Longhorn UI still work immediately
Troubleshooting
If Pods Show 1/2 Ready
Symptom: Longhorn manager pods stuck at 1/2 Ready
Cause: NetworkPolicy may have been accidentally applied
Solution:
# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system
# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
# Wait 30 seconds
sleep 30
# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager
If S3 API Calls Remain High
Check poll interval is applied:
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
Restart Longhorn managers to pick up changes:
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
If Backups Fail
Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:
# Check recurring jobs
kubectl get recurringjobs -n longhorn-system
# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup
# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
References
- Longhorn Issue #1547 - Original excessive S3 calls issue
- Longhorn Backup Target Documentation
- Longhorn version: v1.9.0
Files Modified
- ✅
config-map.yaml- Updatedbackupstore-poll-intervalto 86400 - ✅
kustomization.yaml- Removed network-policy-s3-block.yaml reference - ✅
network-policy-s3-block.yaml- Retained for reference (not applied) - ✅
S3-API-SOLUTION-FINAL.md- This document
Lessons Learned
- NetworkPolicies are tricky: Blocking external traffic can inadvertently block internal cluster communication
- Start simple: Configuration-based solutions are often more reliable than complex automation
- Test thoroughly: Always verify pods remain healthy after applying NetworkPolicies
- Understand the feature: Longhorn's polling is for DR volumes, which we don't use
- 24-hour polling is sufficient: For non-DR use cases, frequent polling isn't necessary
Success Metrics
Monitor these over the next week:
- ✅ Longhorn manager pods: 2/2 Ready
- ✅ Daily backups: Completing successfully
- ✅ S3 API calls: <1,000/day (down from 145,000)
- ✅ Backblaze costs: Within free tier
- ✅ No manual intervention required