2025-12-24 14:35:17 +01:00
# Longhorn S3 API Call Reduction - Final Solution
## Problem Summary
Longhorn was making **145,000+ Class C API calls/day ** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.
## Root Cause
Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.
## Solution History
### Attempt 1: NetworkPolicy-Based Access Control ❌
**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).
**Why It Failed**:
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
- Longhorn manager pods couldn't perform leader election or webhook operations
- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
- Cilium/NetworkPolicy interaction complexity made it unreliable
**Files Created** (kept for reference):
- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
- Removed from `kustomization.yaml` but retained in repository
## Final Solution: Increased Poll Interval ✅
### Implementation
**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0` .
**Location**: `manifests/infrastructure/longhorn/config-map.yaml`
```yaml
data:
default-resource.yaml: |-
2025-12-24 14:39:47 +01:00
"backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
2025-12-24 14:35:17 +01:00
"backup-target-credential-secret": "backblaze-credentials"
"backupstore-poll-interval": "86400" # 24 hours
"virtual-hosted-style": "true"
```
### Why This Works
1. **Dramatic Reduction ** : Polling happens once per day instead of continuously
2. **No Breakage ** : Kubernetes API, webhooks, and leader election work normally
3. **Simple ** : No complex NetworkPolicies or CronJobs to manage
4. **Reliable ** : Well-tested Longhorn configuration option
5. **Sufficient ** : Backups don't require frequent polling since we use scheduled recurring jobs
### Expected Results
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Poll Frequency ** | Every ~5 seconds | Every 24 hours | **99.99% reduction ** |
| **Daily S3 API Calls ** | 145,000+ | ~300-1,000 | **99% reduction ** 📉 |
| **Backblaze Costs ** | Exceeds free tier | Within free tier | ✅ |
| **System Stability ** | Affected by NetworkPolicy | Stable | ✅ |
## Current Status
✅ **Applied ** : ConfigMap updated with `backupstore-poll-interval: 86400`
✅ **Verified ** : Longhorn manager pods are 2/2 Ready
✅ **Backups ** : Continue working normally via recurring jobs
✅ **Monitoring ** : Backblaze API usage should drop to <1,000 calls/day
## Monitoring
### Check Longhorn Manager Health
```bash
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods
```
### Check Poll Interval Setting
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"
```
### Check Backups Continue Working
```bash
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status
```
### Monitor Backblaze API Usage
1. Log into Backblaze B2 dashboard
2. Navigate to "Caps and Alerts"
3. Check "Class C Transactions" (includes `s3_list_objects` )
4. **Expected ** : Should drop from 145,000/day to ~300-1,000/day within 24-48 hours
## Backup Schedule (Unchanged)
| Type | Schedule | Retention |
|------|----------|-----------|
| **Daily ** | 2:00 AM | 7 days |
| **Weekly ** | 1:00 AM Sundays | 4 weeks |
Backups are triggered by `RecurringJob` resources, not by polling.
## Why Polling Isn't Critical
**Longhorn's backupstore polling is primarily for**:
- Disaster Recovery (DR) volumes that need continuous sync
- Detecting backups created outside the cluster
**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
- ✅ Once-daily polling is more than sufficient
- ✅ Backups work independently of polling frequency
- ✅ Manual backups via Longhorn UI still work immediately
## Troubleshooting
### If Pods Show 1/2 Ready
**Symptom**: Longhorn manager pods stuck at 1/2 Ready
**Cause**: NetworkPolicy may have been accidentally applied
**Solution**:
```bash
# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system
# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
# Wait 30 seconds
sleep 30
# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager
```
### If S3 API Calls Remain High
**Check poll interval is applied**:
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
```
**Restart Longhorn managers to pick up changes**:
```bash
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
```
### If Backups Fail
Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:
```bash
# Check recurring jobs
kubectl get recurringjobs -n longhorn-system
# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup
# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
2025-12-24 14:39:47 +01:00
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
2025-12-24 14:35:17 +01:00
```
## References
- [Longhorn Issue #1547 ](https://github.com/longhorn/longhorn/issues/1547 ) - Original excessive S3 calls issue
- [Longhorn Backup Target Documentation ](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/ )
- Longhorn version: v1.9.0
## Files Modified
1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
4. ✅ `S3-API-SOLUTION-FINAL.md` - This document
## Lessons Learned
1. **NetworkPolicies are tricky ** : Blocking external traffic can inadvertently block internal cluster communication
2. **Start simple ** : Configuration-based solutions are often more reliable than complex automation
3. **Test thoroughly ** : Always verify pods remain healthy after applying NetworkPolicies
4. **Understand the feature ** : Longhorn's polling is for DR volumes, which we don't use
5. **24-hour polling is sufficient ** : For non-DR use cases, frequent polling isn't necessary
## Success Metrics
Monitor these over the next week:
- ✅ Longhorn manager pods: 2/2 Ready
- ✅ Daily backups: Completing successfully
- ✅ S3 API calls: <1,000/day (down from 145,000)
- ✅ Backblaze costs: Within free tier
- ✅ No manual intervention required