manifests/infrastructure/longhorn/S3-API-SOLUTION-FINAL.md

# Longhorn S3 API Call Reduction - Final Solution

## Problem Summary

Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.

## Root Cause

Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.

## Solution History

### Attempt 1: NetworkPolicy-Based Access Control ❌

**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).

**Why It Failed**:
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
- Longhorn manager pods couldn't perform leader election or webhook operations
- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
- Cilium/NetworkPolicy interaction complexity made it unreliable

**Files Created** (kept for reference):
- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
- Removed from `kustomization.yaml` but retained in repository

## Final Solution: Increased Poll Interval ✅

### Implementation

**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.

**Location**: `manifests/infrastructure/longhorn/config-map.yaml`

```yaml
data:
  default-resource.yaml: |-
    "backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
    "backup-target-credential-secret": "backblaze-credentials"
    "backupstore-poll-interval": "86400"  # 24 hours
    "virtual-hosted-style": "true"
```

### Why This Works

1. **Dramatic Reduction**: Polling happens once per day instead of continuously
2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally
3. **Simple**: No complex NetworkPolicies or CronJobs to manage
4. **Reliable**: Well-tested Longhorn configuration option
5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs

### Expected Results

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** |
| **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 |
| **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ |
| **System Stability** | Affected by NetworkPolicy | Stable | ✅ |

## Current Status

✅ **Applied**: ConfigMap updated with `backupstore-poll-interval: 86400`  
✅ **Verified**: Longhorn manager pods are 2/2 Ready  
✅ **Backups**: Continue working normally via recurring jobs  
✅ **Monitoring**: Backblaze API usage should drop to <1,000 calls/day  

## Monitoring

### Check Longhorn Manager Health

```bash
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods
```

### Check Poll Interval Setting

```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"
```

### Check Backups Continue Working

```bash
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status
```

### Monitor Backblaze API Usage

1. Log into Backblaze B2 dashboard
2. Navigate to "Caps and Alerts"
3. Check "Class C Transactions" (includes `s3_list_objects`)
4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours

## Backup Schedule (Unchanged)

| Type | Schedule | Retention |
|------|----------|-----------|
| **Daily** | 2:00 AM | 7 days |
| **Weekly** | 1:00 AM Sundays | 4 weeks |

Backups are triggered by `RecurringJob` resources, not by polling.

## Why Polling Isn't Critical

**Longhorn's backupstore polling is primarily for**:
- Disaster Recovery (DR) volumes that need continuous sync
- Detecting backups created outside the cluster

**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
- ✅ Once-daily polling is more than sufficient
- ✅ Backups work independently of polling frequency
- ✅ Manual backups via Longhorn UI still work immediately

## Troubleshooting

### If Pods Show 1/2 Ready

**Symptom**: Longhorn manager pods stuck at 1/2 Ready

**Cause**: NetworkPolicy may have been accidentally applied

**Solution**:
```bash
# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system

# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# Wait 30 seconds
sleep 30

# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager
```

### If S3 API Calls Remain High

**Check poll interval is applied**:
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
```

**Restart Longhorn managers to pick up changes**:
```bash
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
```

### If Backups Fail

Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:

```bash
# Check recurring jobs
kubectl get recurringjobs -n longhorn-system

# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup

# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
```

## References

- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue
- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
- Longhorn version: v1.9.0

## Files Modified

1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
4. ✅ `S3-API-SOLUTION-FINAL.md` - This document

## Lessons Learned

1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication
2. **Start simple**: Configuration-based solutions are often more reliable than complex automation
3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies
4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use
5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary

## Success Metrics

Monitor these over the next week:

- ✅ Longhorn manager pods: 2/2 Ready
- ✅ Daily backups: Completing successfully
- ✅ S3 API calls: <1,000/day (down from 145,000)
- ✅ Backblaze costs: Within free tier
- ✅ No manual intervention required
add source code and readme 2025-12-24 14:35:17 +01:00			`# Longhorn S3 API Call Reduction - Final Solution`

			`## Problem Summary`

			Longhorn was making 145,000+ Class C API calls/day to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.

			`## Root Cause`

			Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.

			`## Solution History`

			`### Attempt 1: NetworkPolicy-Based Access Control ❌`

			`Approach: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).`

			`Why It Failed:`
			`- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server`
			`- Longhorn manager pods couldn't perform leader election or webhook operations`
			- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
			`- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive`
			`- Cilium/NetworkPolicy interaction complexity made it unreliable`

			`Files Created (kept for reference):`
			- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
			- Removed from `kustomization.yaml` but retained in repository

			`## Final Solution: Increased Poll Interval ✅`

			`### Implementation`

			Change: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.

			Location: `manifests/infrastructure/longhorn/config-map.yaml`

			```yaml
			`data:`
			`default-resource.yaml: \|-`
more redaction 2025-12-24 14:39:47 +01:00			`"backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"`
add source code and readme 2025-12-24 14:35:17 +01:00			`"backup-target-credential-secret": "backblaze-credentials"`
			`"backupstore-poll-interval": "86400" # 24 hours`
			`"virtual-hosted-style": "true"`
			```

			`### Why This Works`

			`1. Dramatic Reduction: Polling happens once per day instead of continuously`
			`2. No Breakage: Kubernetes API, webhooks, and leader election work normally`
			`3. Simple: No complex NetworkPolicies or CronJobs to manage`
			`4. Reliable: Well-tested Longhorn configuration option`
			`5. Sufficient: Backups don't require frequent polling since we use scheduled recurring jobs`

			`### Expected Results`

			`\| Metric \| Before \| After \| Improvement \|`
			`\|--------\|--------\|-------\|-------------\|`
			`\| Poll Frequency \| Every ~5 seconds \| Every 24 hours \| 99.99% reduction \|`
			`\| Daily S3 API Calls \| 145,000+ \| ~300-1,000 \| 99% reduction 📉 \|`
			`\| Backblaze Costs \| Exceeds free tier \| Within free tier \| ✅ \|`
			`\| System Stability \| Affected by NetworkPolicy \| Stable \| ✅ \|`

			`## Current Status`

			✅ Applied: ConfigMap updated with `backupstore-poll-interval: 86400`
			`✅ Verified: Longhorn manager pods are 2/2 Ready`
			`✅ Backups: Continue working normally via recurring jobs`
			`✅ Monitoring: Backblaze API usage should drop to <1,000 calls/day`

			`## Monitoring`

			`### Check Longhorn Manager Health`

			```bash
			`kubectl get pods -n longhorn-system -l app=longhorn-manager`
			`# Should show: 2/2 Ready for all pods`
			```

			`### Check Poll Interval Setting`

			```bash
			`kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' \| grep backupstore-poll-interval`
			`# Should show: "backupstore-poll-interval": "86400"`
			```

			`### Check Backups Continue Working`

			```bash
			`kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt \| tail -10`
			`# Should see recent backups with "Completed" status`
			```

			`### Monitor Backblaze API Usage`

			`1. Log into Backblaze B2 dashboard`
			`2. Navigate to "Caps and Alerts"`
			3. Check "Class C Transactions" (includes `s3_list_objects`)
			`4. Expected: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours`

			`## Backup Schedule (Unchanged)`

			`\| Type \| Schedule \| Retention \|`
			`\|------\|----------\|-----------\|`
			`\| Daily \| 2:00 AM \| 7 days \|`
			`\| Weekly \| 1:00 AM Sundays \| 4 weeks \|`

			Backups are triggered by `RecurringJob` resources, not by polling.

			`## Why Polling Isn't Critical`

			`Longhorn's backupstore polling is primarily for:`
			`- Disaster Recovery (DR) volumes that need continuous sync`
			`- Detecting backups created outside the cluster`

			`We don't use DR volumes, and all backups are created by recurring jobs within the cluster, so:`
			`- ✅ Once-daily polling is more than sufficient`
			`- ✅ Backups work independently of polling frequency`
			`- ✅ Manual backups via Longhorn UI still work immediately`

			`## Troubleshooting`

			`### If Pods Show 1/2 Ready`

			`Symptom: Longhorn manager pods stuck at 1/2 Ready`

			`Cause: NetworkPolicy may have been accidentally applied`

			`Solution:`
			```bash
			`# Check for NetworkPolicy`
			`kubectl get networkpolicy -n longhorn-system`

			`# If found, delete it`
			`kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access`

			`# Wait 30 seconds`
			`sleep 30`

			`# Verify pods recover`
			`kubectl get pods -n longhorn-system -l app=longhorn-manager`
			```

			`### If S3 API Calls Remain High`

			`Check poll interval is applied:`
			```bash
			`kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml`
			```

			`Restart Longhorn managers to pick up changes:`
			```bash
			`kubectl rollout restart daemonset -n longhorn-system longhorn-manager`
			```

			`### If Backups Fail`

			`Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:`

			```bash
			`# Check recurring jobs`
			`kubectl get recurringjobs -n longhorn-system`

			`# Check recent backup jobs`
			`kubectl get jobs -n longhorn-system \| grep backup`

			`# Check backup target connectivity (should work anytime)`
			`MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers \| head -1 \| awk '{print $1}')`
more redaction 2025-12-24 14:39:47 +01:00			`kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>`
add source code and readme 2025-12-24 14:35:17 +01:00			```

			`## References`

			`- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue`
			`- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)`
			`- Longhorn version: v1.9.0`

			`## Files Modified`

			1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
			2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
			3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
			4. ✅ `S3-API-SOLUTION-FINAL.md` - This document

			`## Lessons Learned`

			`1. NetworkPolicies are tricky: Blocking external traffic can inadvertently block internal cluster communication`
			`2. Start simple: Configuration-based solutions are often more reliable than complex automation`
			`3. Test thoroughly: Always verify pods remain healthy after applying NetworkPolicies`
			`4. Understand the feature: Longhorn's polling is for DR volumes, which we don't use`
			`5. 24-hour polling is sufficient: For non-DR use cases, frequent polling isn't necessary`

			`## Success Metrics`

			`Monitor these over the next week:`

			`- ✅ Longhorn manager pods: 2/2 Ready`
			`- ✅ Daily backups: Completing successfully`
			`- ✅ S3 API calls: <1,000/day (down from 145,000)`
			`- ✅ Backblaze costs: Within free tier`
			`- ✅ No manual intervention required`