201 lines
7.1 KiB
Markdown
201 lines
7.1 KiB
Markdown
|
|
# Longhorn S3 API Call Reduction - Final Solution
|
||
|
|
|
||
|
|
## Problem Summary
|
||
|
|
|
||
|
|
Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.
|
||
|
|
|
||
|
|
## Root Cause
|
||
|
|
|
||
|
|
Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.
|
||
|
|
|
||
|
|
## Solution History
|
||
|
|
|
||
|
|
### Attempt 1: NetworkPolicy-Based Access Control ❌
|
||
|
|
|
||
|
|
**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).
|
||
|
|
|
||
|
|
**Why It Failed**:
|
||
|
|
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
|
||
|
|
- Longhorn manager pods couldn't perform leader election or webhook operations
|
||
|
|
- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
|
||
|
|
- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
|
||
|
|
- Cilium/NetworkPolicy interaction complexity made it unreliable
|
||
|
|
|
||
|
|
**Files Created** (kept for reference):
|
||
|
|
- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
|
||
|
|
- Removed from `kustomization.yaml` but retained in repository
|
||
|
|
|
||
|
|
## Final Solution: Increased Poll Interval ✅
|
||
|
|
|
||
|
|
### Implementation
|
||
|
|
|
||
|
|
**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.
|
||
|
|
|
||
|
|
**Location**: `manifests/infrastructure/longhorn/config-map.yaml`
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
data:
|
||
|
|
default-resource.yaml: |-
|
||
|
|
"backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
|
||
|
|
"backup-target-credential-secret": "backblaze-credentials"
|
||
|
|
"backupstore-poll-interval": "86400" # 24 hours
|
||
|
|
"virtual-hosted-style": "true"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Why This Works
|
||
|
|
|
||
|
|
1. **Dramatic Reduction**: Polling happens once per day instead of continuously
|
||
|
|
2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally
|
||
|
|
3. **Simple**: No complex NetworkPolicies or CronJobs to manage
|
||
|
|
4. **Reliable**: Well-tested Longhorn configuration option
|
||
|
|
5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs
|
||
|
|
|
||
|
|
### Expected Results
|
||
|
|
|
||
|
|
| Metric | Before | After | Improvement |
|
||
|
|
|--------|--------|-------|-------------|
|
||
|
|
| **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** |
|
||
|
|
| **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 |
|
||
|
|
| **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ |
|
||
|
|
| **System Stability** | Affected by NetworkPolicy | Stable | ✅ |
|
||
|
|
|
||
|
|
## Current Status
|
||
|
|
|
||
|
|
✅ **Applied**: ConfigMap updated with `backupstore-poll-interval: 86400`
|
||
|
|
✅ **Verified**: Longhorn manager pods are 2/2 Ready
|
||
|
|
✅ **Backups**: Continue working normally via recurring jobs
|
||
|
|
✅ **Monitoring**: Backblaze API usage should drop to <1,000 calls/day
|
||
|
|
|
||
|
|
## Monitoring
|
||
|
|
|
||
|
|
### Check Longhorn Manager Health
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get pods -n longhorn-system -l app=longhorn-manager
|
||
|
|
# Should show: 2/2 Ready for all pods
|
||
|
|
```
|
||
|
|
|
||
|
|
### Check Poll Interval Setting
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
|
||
|
|
# Should show: "backupstore-poll-interval": "86400"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Check Backups Continue Working
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
|
||
|
|
# Should see recent backups with "Completed" status
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitor Backblaze API Usage
|
||
|
|
|
||
|
|
1. Log into Backblaze B2 dashboard
|
||
|
|
2. Navigate to "Caps and Alerts"
|
||
|
|
3. Check "Class C Transactions" (includes `s3_list_objects`)
|
||
|
|
4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours
|
||
|
|
|
||
|
|
## Backup Schedule (Unchanged)
|
||
|
|
|
||
|
|
| Type | Schedule | Retention |
|
||
|
|
|------|----------|-----------|
|
||
|
|
| **Daily** | 2:00 AM | 7 days |
|
||
|
|
| **Weekly** | 1:00 AM Sundays | 4 weeks |
|
||
|
|
|
||
|
|
Backups are triggered by `RecurringJob` resources, not by polling.
|
||
|
|
|
||
|
|
## Why Polling Isn't Critical
|
||
|
|
|
||
|
|
**Longhorn's backupstore polling is primarily for**:
|
||
|
|
- Disaster Recovery (DR) volumes that need continuous sync
|
||
|
|
- Detecting backups created outside the cluster
|
||
|
|
|
||
|
|
**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
|
||
|
|
- ✅ Once-daily polling is more than sufficient
|
||
|
|
- ✅ Backups work independently of polling frequency
|
||
|
|
- ✅ Manual backups via Longhorn UI still work immediately
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### If Pods Show 1/2 Ready
|
||
|
|
|
||
|
|
**Symptom**: Longhorn manager pods stuck at 1/2 Ready
|
||
|
|
|
||
|
|
**Cause**: NetworkPolicy may have been accidentally applied
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# Check for NetworkPolicy
|
||
|
|
kubectl get networkpolicy -n longhorn-system
|
||
|
|
|
||
|
|
# If found, delete it
|
||
|
|
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
|
||
|
|
|
||
|
|
# Wait 30 seconds
|
||
|
|
sleep 30
|
||
|
|
|
||
|
|
# Verify pods recover
|
||
|
|
kubectl get pods -n longhorn-system -l app=longhorn-manager
|
||
|
|
```
|
||
|
|
|
||
|
|
### If S3 API Calls Remain High
|
||
|
|
|
||
|
|
**Check poll interval is applied**:
|
||
|
|
```bash
|
||
|
|
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
**Restart Longhorn managers to pick up changes**:
|
||
|
|
```bash
|
||
|
|
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
|
||
|
|
```
|
||
|
|
|
||
|
|
### If Backups Fail
|
||
|
|
|
||
|
|
Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check recurring jobs
|
||
|
|
kubectl get recurringjobs -n longhorn-system
|
||
|
|
|
||
|
|
# Check recent backup jobs
|
||
|
|
kubectl get jobs -n longhorn-system | grep backup
|
||
|
|
|
||
|
|
# Check backup target connectivity (should work anytime)
|
||
|
|
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
|
||
|
|
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
|
||
|
|
```
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue
|
||
|
|
- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
|
||
|
|
- Longhorn version: v1.9.0
|
||
|
|
|
||
|
|
## Files Modified
|
||
|
|
|
||
|
|
1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
|
||
|
|
2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
|
||
|
|
3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
|
||
|
|
4. ✅ `S3-API-SOLUTION-FINAL.md` - This document
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication
|
||
|
|
2. **Start simple**: Configuration-based solutions are often more reliable than complex automation
|
||
|
|
3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies
|
||
|
|
4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use
|
||
|
|
5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary
|
||
|
|
|
||
|
|
## Success Metrics
|
||
|
|
|
||
|
|
Monitor these over the next week:
|
||
|
|
|
||
|
|
- ✅ Longhorn manager pods: 2/2 Ready
|
||
|
|
- ✅ Daily backups: Completing successfully
|
||
|
|
- ✅ S3 API calls: <1,000/day (down from 145,000)
|
||
|
|
- ✅ Backblaze costs: Within free tier
|
||
|
|
- ✅ No manual intervention required
|
||
|
|
|