Files
Keybard-Vagabond-Demo/manifests/infrastructure/longhorn/S3-API-SOLUTION-FINAL.md

7.2 KiB

Longhorn S3 API Call Reduction - Final Solution

Problem Summary

Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily s3_list_objects operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.

Root Cause

Longhorn's backupstore-poll-interval setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.

Solution History

Attempt 1: NetworkPolicy-Based Access Control

Approach: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).

Why It Failed:

  • NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
  • Longhorn manager pods couldn't perform leader election or webhook operations
  • Pods entered 1/2 Ready state with errors: error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout
  • Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
  • Cilium/NetworkPolicy interaction complexity made it unreliable

Files Created (kept for reference):

  • network-policy-s3-block.yaml - CronJobs and NetworkPolicy definitions
  • Removed from kustomization.yaml but retained in repository

Final Solution: Increased Poll Interval

Implementation

Change: Set backupstore-poll-interval to 86400 seconds (24 hours) instead of 0.

Location: manifests/infrastructure/longhorn/config-map.yaml

data:
  default-resource.yaml: |-
    "backup-target": "s3://longhorn-keyboard-vagabond@eu-central-003.backblazeb2.com/longhorn-backup"
    "backup-target-credential-secret": "backblaze-credentials"
    "backupstore-poll-interval": "86400"  # 24 hours
    "virtual-hosted-style": "true"

Why This Works

  1. Dramatic Reduction: Polling happens once per day instead of continuously
  2. No Breakage: Kubernetes API, webhooks, and leader election work normally
  3. Simple: No complex NetworkPolicies or CronJobs to manage
  4. Reliable: Well-tested Longhorn configuration option
  5. Sufficient: Backups don't require frequent polling since we use scheduled recurring jobs

Expected Results

Metric Before After Improvement
Poll Frequency Every ~5 seconds Every 24 hours 99.99% reduction
Daily S3 API Calls 145,000+ ~300-1,000 99% reduction 📉
Backblaze Costs Exceeds free tier Within free tier
System Stability Affected by NetworkPolicy Stable

Current Status

Applied: ConfigMap updated with backupstore-poll-interval: 86400
Verified: Longhorn manager pods are 2/2 Ready
Backups: Continue working normally via recurring jobs
Monitoring: Backblaze API usage should drop to <1,000 calls/day

Monitoring

Check Longhorn Manager Health

kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods

Check Poll Interval Setting

kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"

Check Backups Continue Working

kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status

Monitor Backblaze API Usage

  1. Log into Backblaze B2 dashboard
  2. Navigate to "Caps and Alerts"
  3. Check "Class C Transactions" (includes s3_list_objects)
  4. Expected: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours

Backup Schedule (Unchanged)

Type Schedule Retention
Daily 2:00 AM 7 days
Weekly 1:00 AM Sundays 4 weeks

Backups are triggered by RecurringJob resources, not by polling.

Why Polling Isn't Critical

Longhorn's backupstore polling is primarily for:

  • Disaster Recovery (DR) volumes that need continuous sync
  • Detecting backups created outside the cluster

We don't use DR volumes, and all backups are created by recurring jobs within the cluster, so:

  • Once-daily polling is more than sufficient
  • Backups work independently of polling frequency
  • Manual backups via Longhorn UI still work immediately

Troubleshooting

If Pods Show 1/2 Ready

Symptom: Longhorn manager pods stuck at 1/2 Ready

Cause: NetworkPolicy may have been accidentally applied

Solution:

# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system

# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# Wait 30 seconds
sleep 30

# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager

If S3 API Calls Remain High

Check poll interval is applied:

kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml

Restart Longhorn managers to pick up changes:

kubectl rollout restart daemonset -n longhorn-system longhorn-manager

If Backups Fail

Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:

# Check recurring jobs
kubectl get recurringjobs -n longhorn-system

# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup

# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://eu-central-003.backblazeb2.com

References

Files Modified

  1. config-map.yaml - Updated backupstore-poll-interval to 86400
  2. kustomization.yaml - Removed network-policy-s3-block.yaml reference
  3. network-policy-s3-block.yaml - Retained for reference (not applied)
  4. S3-API-SOLUTION-FINAL.md - This document

Lessons Learned

  1. NetworkPolicies are tricky: Blocking external traffic can inadvertently block internal cluster communication
  2. Start simple: Configuration-based solutions are often more reliable than complex automation
  3. Test thoroughly: Always verify pods remain healthy after applying NetworkPolicies
  4. Understand the feature: Longhorn's polling is for DR volumes, which we don't use
  5. 24-hour polling is sufficient: For non-DR use cases, frequent polling isn't necessary

Success Metrics

Monitor these over the next week:

  • Longhorn manager pods: 2/2 Ready
  • Daily backups: Completing successfully
  • S3 API calls: <1,000/day (down from 145,000)
  • Backblaze costs: Within free tier
  • No manual intervention required