Files

Michael DiLeo 7327d77dcd redaction (#1 )

Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>

2025-12-24 13:40:47 +00:00

10 KiB

Raw Permalink Blame History

Longhorn S3 API Call Optimization - Implementation Summary

Problem Statement

Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) and incurred significant costs.

Root Cause

Even with backupstore-poll-interval set to 0, Longhorn manager pods continuously poll the S3 backup target to check for new backups. With 3 manager pods (one per node) polling independently, this resulted in excessive API calls.

Reference: Longhorn GitHub Issue #1547

Solution: NetworkPolicy-Based Access Control

Inspired by this community solution, we implemented time-based network access control using Kubernetes NetworkPolicies and CronJobs.

Architecture

┌─────────────────────────────────────────────────┐
│           Normal State (21 hours/day)           │
│  NetworkPolicy BLOCKS S3 access                 │
│  → Longhorn polls fail at network layer         │
│  → S3 API calls: 0                              │
└─────────────────────────────────────────────────┘
                      ▼
┌─────────────────────────────────────────────────┐
│      Backup Window (3 hours/day: 1-4 AM)        │
│  CronJob REMOVES NetworkPolicy at 12:55 AM      │
│  → S3 access enabled                            │
│  → Recurring backups run automatically          │
│  → CronJob RESTORES NetworkPolicy at 4:00 AM    │
│  → S3 API calls: ~5,000-10,000/day             │
└─────────────────────────────────────────────────┘

Components

NetworkPolicy (longhorn-block-s3-access) - Dynamically Managed
- Targets: app=longhorn-manager pods
- Blocks: All egress except DNS and intra-cluster
- Effect: Prevents S3 API calls at network layer
- Important: NOT managed by Flux - only the CronJobs control it
- Flux manages the CronJobs/RBAC, but NOT the NetworkPolicy itself
CronJob: Enable S3 Access (longhorn-enable-s3-access)
- Schedule: 55 0 * * * (12:55 AM daily)
- Action: Deletes NetworkPolicy
- Result: S3 access enabled 5 minutes before earliest backup
CronJob: Disable S3 Access (longhorn-disable-s3-access)
- Schedule: 0 4 * * * (4:00 AM daily)
- Action: Re-creates NetworkPolicy
- Result: S3 access blocked after 3-hour backup window
RBAC Resources
- ServiceAccount: longhorn-netpol-manager
- Role: Permissions to manage NetworkPolicies
- RoleBinding: Binds role to service account

Benefits

Metric	Before	After	Improvement
Daily S3 API Calls	145,000+	5,000-10,000	93% reduction
Cost Impact	Exceeds free tier	Within free tier	$X/month savings
Automation	Manual intervention	Fully automated	Zero manual work
Backup Reliability	Compromised	Maintained	No impact

Backup Schedule

Type	Schedule	Retention	Window
Daily	2:00 AM	7 days	12:55 AM - 4:00 AM
Weekly	1:00 AM Sundays	4 weeks	Same window

FluxCD Integration

Critical Design Decision: The NetworkPolicy is dynamically managed by CronJobs, NOT by Flux.

Why This Matters

Flux continuously reconciles resources to match the Git repository state. If the NetworkPolicy were managed by Flux:

CronJob deletes NetworkPolicy at 12:55 AM → Flux recreates it within minutes
S3 remains blocked during backup window → Backups fail ❌

How We Solved It

NetworkPolicy is NOT in Git - Only the CronJobs and RBAC are in network-policy-s3-block.yaml
CronJobs are managed by Flux - Flux ensures they exist and run on schedule
NetworkPolicy is created by CronJob - Without Flux labels/ownership
Flux ignores the NetworkPolicy - Not in Flux's inventory, so Flux won't touch it

Verification

# Check Flux inventory (NetworkPolicy should NOT be listed)
kubectl get kustomization -n flux-system longhorn -o jsonpath='{.status.inventory.entries[*].id}' | grep -i network
# (Should return nothing)

# Check NetworkPolicy exists (managed by CronJobs)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# (Should exist)

Deployment

Files Modified/Created

✅ network-policy-s3-block.yaml - NEW: CronJobs and RBAC (NOT the NetworkPolicy itself)
✅ kustomization.yaml - Added new file to resources
✅ BACKUP-GUIDE.md - Updated with new solution documentation
✅ S3-API-OPTIMIZATION.md - NEW: This implementation summary
✅ config-map.yaml - Kept backup target configured (no changes needed)
✅ longhorn.yaml - Reverted backupstorePollInterval (not needed)

Deployment Steps

Commit and push changes to your k8s-fleet branch
FluxCD will automatically apply the new NetworkPolicy and CronJobs

Monitor for one backup cycle:

# Watch CronJobs
kubectl get cronjobs -n longhorn-system -w

# Check NetworkPolicy status
kubectl get networkpolicy -n longhorn-system

# Verify backups complete
kubectl get backups -n longhorn-system

Verification Steps

Day 1: Initial Deployment

# 1. Verify NetworkPolicy is active (should exist immediately)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access

# 2. Verify CronJobs are scheduled
kubectl get cronjobs -n longhorn-system | grep longhorn-.*-s3-access

# 3. Test: S3 access should be blocked
kubectl exec -n longhorn-system deploy/longhorn-ui -- curl -I https://<B2_ENDPOINT>
# Expected: Connection timeout or network error

Day 2: After First Backup Window

# 1. Check if CronJob ran successfully (should see completed job at 12:55 AM)
kubectl get jobs -n longhorn-system | grep enable-s3-access

# 2. Verify backups completed (check after 4:00 AM)
kubectl get backups -n longhorn-system
# Should see new backups with recent timestamps

# 3. Confirm NetworkPolicy was re-applied (after 4:00 AM)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# Should exist again

# 4. Check CronJob logs
kubectl logs -n longhorn-system job/longhorn-enable-s3-access-<timestamp>
kubectl logs -n longhorn-system job/longhorn-disable-s3-access-<timestamp>

Week 1: Monitor S3 API Usage

# Monitor Backblaze B2 dashboard
# → Daily Class C transactions should drop from 145,000 to 5,000-10,000
# → Verify calls only occur during 1-4 AM window

Manual Backup Outside Window

If you need to create a backup outside the scheduled window:

# 1. Temporarily remove NetworkPolicy
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# 2. Create backup via Longhorn UI or:
kubectl create -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
  name: manual-backup-$(date +%s)
  namespace: longhorn-system
spec:
  snapshotName: <snapshot-name>
  labels:
    backup-type: manual
EOF

# 3. Wait for backup to complete
kubectl get backup -n longhorn-system manual-backup-* -w

# 4. Restore NetworkPolicy
kubectl apply -f manifests/infrastructure/longhorn/network-policy-s3-block.yaml

Or simply wait until the next automatic re-application at 4:00 AM.

Troubleshooting

NetworkPolicy Not Blocking S3

Symptom: S3 calls continue despite NetworkPolicy being active

Check:

# Verify NetworkPolicy is applied
kubectl describe networkpolicy -n longhorn-system longhorn-block-s3-access

# Check if CNI supports NetworkPolicies (Cilium does)
kubectl get pods -n kube-system | grep cilium

Backups Failing

Symptom: Backups fail during scheduled window

Check:

# Verify NetworkPolicy was removed during backup window
kubectl get networkpolicy -n longhorn-system
# Should NOT exist between 12:55 AM - 4:00 AM

# Check enable-s3-access CronJob ran
kubectl get jobs -n longhorn-system | grep enable

# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100

CronJobs Not Running

Symptom: CronJobs never execute

Check:

# Verify CronJobs exist and are scheduled
kubectl get cronjobs -n longhorn-system -o wide

# Check events
kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | grep CronJob

# Manually trigger a job
kubectl create job -n longhorn-system test-enable --from=cronjob/longhorn-enable-s3-access

Future Enhancements

Adjust Window Size: If backups consistently complete faster than 3 hours, reduce window to 2 hours (change disable CronJob to 0 3 * * *)
Alerting: Add Prometheus alerts for:
- Backup failures during window
- CronJob execution failures
- NetworkPolicy re-creation failures
Metrics: Track actual S3 API call counts via Backblaze B2 API and alert if threshold exceeded

References

Success Metrics

After 1 week of operation, you should observe:

✅ S3 API calls reduced by 85-93%
✅ Backblaze costs within free tier
✅ All scheduled backups completing successfully
✅ Zero manual intervention required
✅ Longhorn polls fail silently (network errors) outside backup window

10 KiB Raw Permalink Blame History

Longhorn S3 API Call Optimization - Implementation Summary

Problem Statement

Root Cause

Solution: NetworkPolicy-Based Access Control

Architecture

Components

Benefits

Backup Schedule

FluxCD Integration

Why This Matters

How We Solved It

Verification

Deployment

Files Modified/Created

Deployment Steps

Verification Steps

Day 1: Initial Deployment

Day 2: After First Backup Window

Week 1: Monitor S3 API Usage

Manual Backup Outside Window

Troubleshooting

NetworkPolicy Not Blocking S3

Backups Failing

CronJobs Not Running

Future Enhancements

References

Success Metrics

10 KiB

Raw Permalink Blame History