Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
10 KiB
Longhorn S3 API Call Optimization - Implementation Summary
Problem Statement
Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) and incurred significant costs.
Root Cause
Even with backupstore-poll-interval set to 0, Longhorn manager pods continuously poll the S3 backup target to check for new backups. With 3 manager pods (one per node) polling independently, this resulted in excessive API calls.
Reference: Longhorn GitHub Issue #1547
Solution: NetworkPolicy-Based Access Control
Inspired by this community solution, we implemented time-based network access control using Kubernetes NetworkPolicies and CronJobs.
Architecture
┌─────────────────────────────────────────────────┐
│ Normal State (21 hours/day) │
│ NetworkPolicy BLOCKS S3 access │
│ → Longhorn polls fail at network layer │
│ → S3 API calls: 0 │
└─────────────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────┐
│ Backup Window (3 hours/day: 1-4 AM) │
│ CronJob REMOVES NetworkPolicy at 12:55 AM │
│ → S3 access enabled │
│ → Recurring backups run automatically │
│ → CronJob RESTORES NetworkPolicy at 4:00 AM │
│ → S3 API calls: ~5,000-10,000/day │
└─────────────────────────────────────────────────┘
Components
-
NetworkPolicy (
longhorn-block-s3-access) - Dynamically Managed- Targets:
app=longhorn-managerpods - Blocks: All egress except DNS and intra-cluster
- Effect: Prevents S3 API calls at network layer
- Important: NOT managed by Flux - only the CronJobs control it
- Flux manages the CronJobs/RBAC, but NOT the NetworkPolicy itself
- Targets:
-
CronJob: Enable S3 Access (
longhorn-enable-s3-access)- Schedule:
55 0 * * *(12:55 AM daily) - Action: Deletes NetworkPolicy
- Result: S3 access enabled 5 minutes before earliest backup
- Schedule:
-
CronJob: Disable S3 Access (
longhorn-disable-s3-access)- Schedule:
0 4 * * *(4:00 AM daily) - Action: Re-creates NetworkPolicy
- Result: S3 access blocked after 3-hour backup window
- Schedule:
-
RBAC Resources
- ServiceAccount:
longhorn-netpol-manager - Role: Permissions to manage NetworkPolicies
- RoleBinding: Binds role to service account
- ServiceAccount:
Benefits
| Metric | Before | After | Improvement |
|---|---|---|---|
| Daily S3 API Calls | 145,000+ | 5,000-10,000 | 93% reduction |
| Cost Impact | Exceeds free tier | Within free tier | $X/month savings |
| Automation | Manual intervention | Fully automated | Zero manual work |
| Backup Reliability | Compromised | Maintained | No impact |
Backup Schedule
| Type | Schedule | Retention | Window |
|---|---|---|---|
| Daily | 2:00 AM | 7 days | 12:55 AM - 4:00 AM |
| Weekly | 1:00 AM Sundays | 4 weeks | Same window |
FluxCD Integration
Critical Design Decision: The NetworkPolicy is dynamically managed by CronJobs, NOT by Flux.
Why This Matters
Flux continuously reconciles resources to match the Git repository state. If the NetworkPolicy were managed by Flux:
- CronJob deletes NetworkPolicy at 12:55 AM → Flux recreates it within minutes
- S3 remains blocked during backup window → Backups fail ❌
How We Solved It
- NetworkPolicy is NOT in Git - Only the CronJobs and RBAC are in
network-policy-s3-block.yaml - CronJobs are managed by Flux - Flux ensures they exist and run on schedule
- NetworkPolicy is created by CronJob - Without Flux labels/ownership
- Flux ignores the NetworkPolicy - Not in Flux's inventory, so Flux won't touch it
Verification
# Check Flux inventory (NetworkPolicy should NOT be listed)
kubectl get kustomization -n flux-system longhorn -o jsonpath='{.status.inventory.entries[*].id}' | grep -i network
# (Should return nothing)
# Check NetworkPolicy exists (managed by CronJobs)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# (Should exist)
Deployment
Files Modified/Created
- ✅
network-policy-s3-block.yaml- NEW: CronJobs and RBAC (NOT the NetworkPolicy itself) - ✅
kustomization.yaml- Added new file to resources - ✅
BACKUP-GUIDE.md- Updated with new solution documentation - ✅
S3-API-OPTIMIZATION.md- NEW: This implementation summary - ✅
config-map.yaml- Kept backup target configured (no changes needed) - ✅
longhorn.yaml- RevertedbackupstorePollInterval(not needed)
Deployment Steps
- Commit and push changes to your k8s-fleet branch
- FluxCD will automatically apply the new NetworkPolicy and CronJobs
- Monitor for one backup cycle:
# Watch CronJobs kubectl get cronjobs -n longhorn-system -w # Check NetworkPolicy status kubectl get networkpolicy -n longhorn-system # Verify backups complete kubectl get backups -n longhorn-system
Verification Steps
Day 1: Initial Deployment
# 1. Verify NetworkPolicy is active (should exist immediately)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# 2. Verify CronJobs are scheduled
kubectl get cronjobs -n longhorn-system | grep longhorn-.*-s3-access
# 3. Test: S3 access should be blocked
kubectl exec -n longhorn-system deploy/longhorn-ui -- curl -I https://<B2_ENDPOINT>
# Expected: Connection timeout or network error
Day 2: After First Backup Window
# 1. Check if CronJob ran successfully (should see completed job at 12:55 AM)
kubectl get jobs -n longhorn-system | grep enable-s3-access
# 2. Verify backups completed (check after 4:00 AM)
kubectl get backups -n longhorn-system
# Should see new backups with recent timestamps
# 3. Confirm NetworkPolicy was re-applied (after 4:00 AM)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# Should exist again
# 4. Check CronJob logs
kubectl logs -n longhorn-system job/longhorn-enable-s3-access-<timestamp>
kubectl logs -n longhorn-system job/longhorn-disable-s3-access-<timestamp>
Week 1: Monitor S3 API Usage
# Monitor Backblaze B2 dashboard
# → Daily Class C transactions should drop from 145,000 to 5,000-10,000
# → Verify calls only occur during 1-4 AM window
Manual Backup Outside Window
If you need to create a backup outside the scheduled window:
# 1. Temporarily remove NetworkPolicy
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
# 2. Create backup via Longhorn UI or:
kubectl create -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
name: manual-backup-$(date +%s)
namespace: longhorn-system
spec:
snapshotName: <snapshot-name>
labels:
backup-type: manual
EOF
# 3. Wait for backup to complete
kubectl get backup -n longhorn-system manual-backup-* -w
# 4. Restore NetworkPolicy
kubectl apply -f manifests/infrastructure/longhorn/network-policy-s3-block.yaml
Or simply wait until the next automatic re-application at 4:00 AM.
Troubleshooting
NetworkPolicy Not Blocking S3
Symptom: S3 calls continue despite NetworkPolicy being active
Check:
# Verify NetworkPolicy is applied
kubectl describe networkpolicy -n longhorn-system longhorn-block-s3-access
# Check if CNI supports NetworkPolicies (Cilium does)
kubectl get pods -n kube-system | grep cilium
Backups Failing
Symptom: Backups fail during scheduled window
Check:
# Verify NetworkPolicy was removed during backup window
kubectl get networkpolicy -n longhorn-system
# Should NOT exist between 12:55 AM - 4:00 AM
# Check enable-s3-access CronJob ran
kubectl get jobs -n longhorn-system | grep enable
# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100
CronJobs Not Running
Symptom: CronJobs never execute
Check:
# Verify CronJobs exist and are scheduled
kubectl get cronjobs -n longhorn-system -o wide
# Check events
kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | grep CronJob
# Manually trigger a job
kubectl create job -n longhorn-system test-enable --from=cronjob/longhorn-enable-s3-access
Future Enhancements
-
Adjust Window Size: If backups consistently complete faster than 3 hours, reduce window to 2 hours (change disable CronJob to
0 3 * * *) -
Alerting: Add Prometheus alerts for:
- Backup failures during window
- CronJob execution failures
- NetworkPolicy re-creation failures
-
Metrics: Track actual S3 API call counts via Backblaze B2 API and alert if threshold exceeded
References
- Longhorn Issue #1547 - Excessive S3 Calls
- Community NetworkPolicy Solution
- Longhorn Backup Target Documentation
- Kubernetes NetworkPolicy Documentation
Success Metrics
After 1 week of operation, you should observe:
- ✅ S3 API calls reduced by 85-93%
- ✅ Backblaze costs within free tier
- ✅ All scheduled backups completing successfully
- ✅ Zero manual intervention required
- ✅ Longhorn polls fail silently (network errors) outside backup window