Files

Michael DiLeo 7327d77dcd redaction (#1 )

Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>

2025-12-24 13:40:47 +00:00

7.1 KiB

Raw Permalink Blame History

Longhorn S3 API Call Reduction - Final Solution

Problem Summary

Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.

Root Cause

Longhorn's backupstore-poll-interval setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.

Solution History

Attempt 1: NetworkPolicy-Based Access Control ❌

Approach: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).

Why It Failed:

NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
Longhorn manager pods couldn't perform leader election or webhook operations
Pods entered 1/2 Ready state with errors: error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout
Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
Cilium/NetworkPolicy interaction complexity made it unreliable

Files Created (kept for reference):

network-policy-s3-block.yaml - CronJobs and NetworkPolicy definitions
Removed from kustomization.yaml but retained in repository

Final Solution: Increased Poll Interval ✅

Implementation

Change: Set backupstore-poll-interval to 86400 seconds (24 hours) instead of 0.

Location: manifests/infrastructure/longhorn/config-map.yaml

data:
  default-resource.yaml: |-
    "backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
    "backup-target-credential-secret": "backblaze-credentials"
    "backupstore-poll-interval": "86400"  # 24 hours
    "virtual-hosted-style": "true"

Why This Works

Dramatic Reduction: Polling happens once per day instead of continuously
No Breakage: Kubernetes API, webhooks, and leader election work normally
Simple: No complex NetworkPolicies or CronJobs to manage
Reliable: Well-tested Longhorn configuration option
Sufficient: Backups don't require frequent polling since we use scheduled recurring jobs

Expected Results

Metric	Before	After	Improvement
Poll Frequency	Every ~5 seconds	Every 24 hours	99.99% reduction
Daily S3 API Calls	145,000+	~300-1,000	99% reduction 📉
Backblaze Costs	Exceeds free tier	Within free tier	✅
System Stability	Affected by NetworkPolicy	Stable	✅

Current Status

✅ Applied: ConfigMap updated with backupstore-poll-interval: 86400
✅ Verified: Longhorn manager pods are 2/2 Ready
✅ Backups: Continue working normally via recurring jobs
✅ Monitoring: Backblaze API usage should drop to <1,000 calls/day

Monitoring

Check Longhorn Manager Health

kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods

Check Poll Interval Setting

kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"

Check Backups Continue Working

kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status

Monitor Backblaze API Usage

Log into Backblaze B2 dashboard
Navigate to "Caps and Alerts"
Check "Class C Transactions" (includes s3_list_objects)
Expected: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours

Backup Schedule (Unchanged)

Type	Schedule	Retention
Daily	2:00 AM	7 days
Weekly	1:00 AM Sundays	4 weeks

Backups are triggered by RecurringJob resources, not by polling.

Why Polling Isn't Critical

Longhorn's backupstore polling is primarily for:

Disaster Recovery (DR) volumes that need continuous sync
Detecting backups created outside the cluster

We don't use DR volumes, and all backups are created by recurring jobs within the cluster, so:

✅ Once-daily polling is more than sufficient
✅ Backups work independently of polling frequency
✅ Manual backups via Longhorn UI still work immediately

Troubleshooting

If Pods Show 1/2 Ready

Symptom: Longhorn manager pods stuck at 1/2 Ready

Cause: NetworkPolicy may have been accidentally applied

Solution:

# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system

# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# Wait 30 seconds
sleep 30

# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager

If S3 API Calls Remain High

Check poll interval is applied:

kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml

Restart Longhorn managers to pick up changes:

kubectl rollout restart daemonset -n longhorn-system longhorn-manager

If Backups Fail

Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:

# Check recurring jobs
kubectl get recurringjobs -n longhorn-system

# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup

# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>

References

Longhorn Issue #1547 - Original excessive S3 calls issue
Longhorn Backup Target Documentation
Longhorn version: v1.9.0

Files Modified

✅ config-map.yaml - Updated backupstore-poll-interval to 86400
✅ kustomization.yaml - Removed network-policy-s3-block.yaml reference
✅ network-policy-s3-block.yaml - Retained for reference (not applied)
✅ S3-API-SOLUTION-FINAL.md - This document

Lessons Learned

NetworkPolicies are tricky: Blocking external traffic can inadvertently block internal cluster communication
Start simple: Configuration-based solutions are often more reliable than complex automation
Test thoroughly: Always verify pods remain healthy after applying NetworkPolicies
Understand the feature: Longhorn's polling is for DR volumes, which we don't use
24-hour polling is sufficient: For non-DR use cases, frequent polling isn't necessary

Success Metrics

Monitor these over the next week:

✅ Longhorn manager pods: 2/2 Ready
✅ Daily backups: Completing successfully
✅ S3 API calls: <1,000/day (down from 145,000)
✅ Backblaze costs: Within free tier
✅ No manual intervention required

7.1 KiB Raw Permalink Blame History