# Longhorn S3 API Call Reduction - Final Solution

## Problem Summary

Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.

## Root Cause

Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.

## Solution History

### Attempt 1: NetworkPolicy-Based Access Control ❌

**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).

**Why It Failed**:
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
- Longhorn manager pods couldn't perform leader election or webhook operations
- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
- Cilium/NetworkPolicy interaction complexity made it unreliable

**Files Created** (kept for reference):
- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
- Removed from `kustomization.yaml` but retained in repository

## Final Solution: Increased Poll Interval ✅

### Implementation

**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.

**Location**: `manifests/infrastructure/longhorn/config-map.yaml`

```yaml
data:
  default-resource.yaml: |-
    "backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
    "backup-target-credential-secret": "backblaze-credentials"
    "backupstore-poll-interval": "86400"  # 24 hours
    "virtual-hosted-style": "true"
```

### Why This Works

1. **Dramatic Reduction**: Polling happens once per day instead of continuously
2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally
3. **Simple**: No complex NetworkPolicies or CronJobs to manage
4. **Reliable**: Well-tested Longhorn configuration option
5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs

### Expected Results

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** |
| **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 |
| **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ |
| **System Stability** | Affected by NetworkPolicy | Stable | ✅ |

## Current Status

✅ **Applied**: ConfigMap updated with `backupstore-poll-interval: 86400`  
✅ **Verified**: Longhorn manager pods are 2/2 Ready  
✅ **Backups**: Continue working normally via recurring jobs  
✅ **Monitoring**: Backblaze API usage should drop to <1,000 calls/day  

## Monitoring

### Check Longhorn Manager Health

```bash
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods
```

### Check Poll Interval Setting

```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"
```

### Check Backups Continue Working

```bash
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status
```

### Monitor Backblaze API Usage

1. Log into Backblaze B2 dashboard
2. Navigate to "Caps and Alerts"
3. Check "Class C Transactions" (includes `s3_list_objects`)
4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours

## Backup Schedule (Unchanged)

| Type | Schedule | Retention |
|------|----------|-----------|
| **Daily** | 2:00 AM | 7 days |
| **Weekly** | 1:00 AM Sundays | 4 weeks |

Backups are triggered by `RecurringJob` resources, not by polling.

## Why Polling Isn't Critical

**Longhorn's backupstore polling is primarily for**:
- Disaster Recovery (DR) volumes that need continuous sync
- Detecting backups created outside the cluster

**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
- ✅ Once-daily polling is more than sufficient
- ✅ Backups work independently of polling frequency
- ✅ Manual backups via Longhorn UI still work immediately

## Troubleshooting

### If Pods Show 1/2 Ready

**Symptom**: Longhorn manager pods stuck at 1/2 Ready

**Cause**: NetworkPolicy may have been accidentally applied

**Solution**:
```bash
# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system

# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access

# Wait 30 seconds
sleep 30

# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager
```

### If S3 API Calls Remain High

**Check poll interval is applied**:
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
```

**Restart Longhorn managers to pick up changes**:
```bash
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
```

### If Backups Fail

Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:

```bash
# Check recurring jobs
kubectl get recurringjobs -n longhorn-system

# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup

# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
```

## References

- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue
- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
- Longhorn version: v1.9.0

## Files Modified

1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
4. ✅ `S3-API-SOLUTION-FINAL.md` - This document

## Lessons Learned

1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication
2. **Start simple**: Configuration-based solutions are often more reliable than complex automation
3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies
4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use
5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary

## Success Metrics

Monitor these over the next week:

- ✅ Longhorn manager pods: 2/2 Ready
- ✅ Daily backups: Completing successfully
- ✅ S3 API calls: <1,000/day (down from 145,000)
- ✅ Backblaze costs: Within free tier
- ✅ No manual intervention required