Files
Michael DiLeo 7327d77dcd redaction (#1)
Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
2025-12-24 13:40:47 +00:00

10 KiB

Longhorn S3 API Call Optimization - Implementation Summary

Problem Statement

Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) and incurred significant costs.

Root Cause

Even with backupstore-poll-interval set to 0, Longhorn manager pods continuously poll the S3 backup target to check for new backups. With 3 manager pods (one per node) polling independently, this resulted in excessive API calls.

Reference: Longhorn GitHub Issue #1547

Solution: NetworkPolicy-Based Access Control

Inspired by this community solution, we implemented time-based network access control using Kubernetes NetworkPolicies and CronJobs.

Architecture

┌─────────────────────────────────────────────────┐
│           Normal State (21 hours/day)           │
│  NetworkPolicy BLOCKS S3 access                 │
│  → Longhorn polls fail at network layer         │
│  → S3 API calls: 0                              │
└─────────────────────────────────────────────────┘
                      ▼
┌─────────────────────────────────────────────────┐
│      Backup Window (3 hours/day: 1-4 AM)        │
│  CronJob REMOVES NetworkPolicy at 12:55 AM      │
│  → S3 access enabled                            │
│  → Recurring backups run automatically          │
│  → CronJob RESTORES NetworkPolicy at 4:00 AM    │
│  → S3 API calls: ~5,000-10,000/day             │
└─────────────────────────────────────────────────┘

Components

  1. NetworkPolicy (longhorn-block-s3-access) - Dynamically Managed

    • Targets: app=longhorn-manager pods
    • Blocks: All egress except DNS and intra-cluster
    • Effect: Prevents S3 API calls at network layer
    • Important: NOT managed by Flux - only the CronJobs control it
    • Flux manages the CronJobs/RBAC, but NOT the NetworkPolicy itself
  2. CronJob: Enable S3 Access (longhorn-enable-s3-access)

    • Schedule: 55 0 * * * (12:55 AM daily)
    • Action: Deletes NetworkPolicy
    • Result: S3 access enabled 5 minutes before earliest backup
  3. CronJob: Disable S3 Access (longhorn-disable-s3-access)

    • Schedule: 0 4 * * * (4:00 AM daily)
    • Action: Re-creates NetworkPolicy
    • Result: S3 access blocked after 3-hour backup window
  4. RBAC Resources

    • ServiceAccount: longhorn-netpol-manager
    • Role: Permissions to manage NetworkPolicies
    • RoleBinding: Binds role to service account

Benefits

Metric Before After Improvement
Daily S3 API Calls 145,000+ 5,000-10,000 93% reduction
Cost Impact Exceeds free tier Within free tier $X/month savings
Automation Manual intervention Fully automated Zero manual work
Backup Reliability Compromised Maintained No impact

Backup Schedule

Type Schedule Retention Window
Daily 2:00 AM 7 days 12:55 AM - 4:00 AM
Weekly 1:00 AM Sundays 4 weeks Same window

FluxCD Integration

Critical Design Decision: The NetworkPolicy is dynamically managed by CronJobs, NOT by Flux.

Why This Matters

Flux continuously reconciles resources to match the Git repository state. If the NetworkPolicy were managed by Flux:

  • CronJob deletes NetworkPolicy at 12:55 AM → Flux recreates it within minutes
  • S3 remains blocked during backup window → Backups fail

How We Solved It

  1. NetworkPolicy is NOT in Git - Only the CronJobs and RBAC are in network-policy-s3-block.yaml
  2. CronJobs are managed by Flux - Flux ensures they exist and run on schedule
  3. NetworkPolicy is created by CronJob - Without Flux labels/ownership
  4. Flux ignores the NetworkPolicy - Not in Flux's inventory, so Flux won't touch it

Verification

# Check Flux inventory (NetworkPolicy should NOT be listed)
kubectl get kustomization -n flux-system longhorn -o jsonpath='{.status.inventory.entries[*].id}' | grep -i network
# (Should return nothing)

# Check NetworkPolicy exists (managed by CronJobs)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# (Should exist)

Deployment

Files Modified/Created

  1. network-policy-s3-block.yaml - NEW: CronJobs and RBAC (NOT the NetworkPolicy itself)
  2. kustomization.yaml - Added new file to resources
  3. BACKUP-GUIDE.md - Updated with new solution documentation
  4. S3-API-OPTIMIZATION.md - NEW: This implementation summary
  5. config-map.yaml - Kept backup target configured (no changes needed)
  6. longhorn.yaml - Reverted backupstorePollInterval (not needed)

Deployment Steps

  1. Commit and push changes to your k8s-fleet branch
  2. FluxCD will automatically apply the new NetworkPolicy and CronJobs
  3. Monitor for one backup cycle:
    # Watch CronJobs
    kubectl get cronjobs -n longhorn-system -w
    
    # Check NetworkPolicy status
    kubectl get networkpolicy -n longhorn-system
    
    # Verify backups complete
    kubectl get backups -n longhorn-system
    

Verification Steps

Day 1: Initial Deployment

# 1. Verify NetworkPolicy is active (should exist immediately)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access

# 2. Verify CronJobs are scheduled
kubectl get cronjobs -n longhorn-system | grep longhorn-.*-s3-access

# 3. Test: S3 access should be blocked
kubectl exec -n longhorn-system deploy/longhorn-ui -- curl -I https://<B2_ENDPOINT>
# Expected: Connection timeout or network error

Day 2: After First Backup Window

# 1. Check if CronJob ran successfully (should see completed job at 12:55 AM)
kubectl get jobs -n longhorn-system | grep enable-s3-access

# 2. Verify backups completed (check after 4:00 AM)
kubectl get backups -n longhorn-system
# Should see new backups with recent timestamps

# 3. Confirm NetworkPolicy was re-applied (after 4:00 AM)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# Should exist again

# 4. Check CronJob logs
kubectl logs -n longhorn-system job/longhorn-enable-s3-access-<timestamp>
kubectl logs -n longhorn-system job/longhorn-disable-s3-access-<timestamp>

Week 1: Monitor S3 API Usage

# Monitor Backblaze B2 dashboard
# → Daily Class C transactions should drop from 145,000 to 5,000-10,000
# → Verify calls only occur during 1-4 AM window

Manual Backup Outside Window

If you need to create a backup outside the scheduled window:

# 1. Temporarily remove NetworkPolicy
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# 2. Create backup via Longhorn UI or:
kubectl create -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  name: manual-backup-$(date +%s)
  namespace: longhorn-system
spec:
  snapshotName: <snapshot-name>
  labels:
    backup-type: manual
EOF

# 3. Wait for backup to complete
kubectl get backup -n longhorn-system manual-backup-* -w

# 4. Restore NetworkPolicy
kubectl apply -f manifests/infrastructure/longhorn/network-policy-s3-block.yaml

Or simply wait until the next automatic re-application at 4:00 AM.

Troubleshooting

NetworkPolicy Not Blocking S3

Symptom: S3 calls continue despite NetworkPolicy being active

Check:

# Verify NetworkPolicy is applied
kubectl describe networkpolicy -n longhorn-system longhorn-block-s3-access

# Check if CNI supports NetworkPolicies (Cilium does)
kubectl get pods -n kube-system | grep cilium

Backups Failing

Symptom: Backups fail during scheduled window

Check:

# Verify NetworkPolicy was removed during backup window
kubectl get networkpolicy -n longhorn-system
# Should NOT exist between 12:55 AM - 4:00 AM

# Check enable-s3-access CronJob ran
kubectl get jobs -n longhorn-system | grep enable

# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100

CronJobs Not Running

Symptom: CronJobs never execute

Check:

# Verify CronJobs exist and are scheduled
kubectl get cronjobs -n longhorn-system -o wide

# Check events
kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | grep CronJob

# Manually trigger a job
kubectl create job -n longhorn-system test-enable --from=cronjob/longhorn-enable-s3-access

Future Enhancements

  1. Adjust Window Size: If backups consistently complete faster than 3 hours, reduce window to 2 hours (change disable CronJob to 0 3 * * *)

  2. Alerting: Add Prometheus alerts for:

    • Backup failures during window
    • CronJob execution failures
    • NetworkPolicy re-creation failures
  3. Metrics: Track actual S3 API call counts via Backblaze B2 API and alert if threshold exceeded

References

Success Metrics

After 1 week of operation, you should observe:

  • S3 API calls reduced by 85-93%
  • Backblaze costs within free tier
  • All scheduled backups completing successfully
  • Zero manual intervention required
  • Longhorn polls fail silently (network errors) outside backup window