redaction (#1)

Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
This commit was merged in pull request #1.
This commit is contained in:
2025-12-24 13:40:47 +00:00
committed by michael_dileo
parent 612235d52b
commit 7327d77dcd
333 changed files with 39286 additions and 1 deletions

View File

@@ -0,0 +1,277 @@
# Longhorn S3 API Call Optimization - Implementation Summary
## Problem Statement
Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) and incurred significant costs.
### Root Cause
Even with `backupstore-poll-interval` set to `0`, Longhorn manager pods continuously poll the S3 backup target to check for new backups. With 3 manager pods (one per node) polling independently, this resulted in excessive API calls.
Reference: [Longhorn GitHub Issue #1547](https://github.com/longhorn/longhorn/issues/1547)
## Solution: NetworkPolicy-Based Access Control
Inspired by [this community solution](https://github.com/longhorn/longhorn/issues/1547#issuecomment-3395447100), we implemented **time-based network access control** using Kubernetes NetworkPolicies and CronJobs.
### Architecture
```
┌─────────────────────────────────────────────────┐
│ Normal State (21 hours/day) │
│ NetworkPolicy BLOCKS S3 access │
│ → Longhorn polls fail at network layer │
│ → S3 API calls: 0 │
└─────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────┐
│ Backup Window (3 hours/day: 1-4 AM) │
│ CronJob REMOVES NetworkPolicy at 12:55 AM │
│ → S3 access enabled │
│ → Recurring backups run automatically │
│ → CronJob RESTORES NetworkPolicy at 4:00 AM │
│ → S3 API calls: ~5,000-10,000/day │
└─────────────────────────────────────────────────┘
```
### Components
1. **NetworkPolicy** (`longhorn-block-s3-access`) - **Dynamically Managed**
- Targets: `app=longhorn-manager` pods
- Blocks: All egress except DNS and intra-cluster
- Effect: Prevents S3 API calls at network layer
- **Important**: NOT managed by Flux - only the CronJobs control it
- Flux manages the CronJobs/RBAC, but NOT the NetworkPolicy itself
2. **CronJob: Enable S3 Access** (`longhorn-enable-s3-access`)
- Schedule: `55 0 * * *` (12:55 AM daily)
- Action: Deletes NetworkPolicy
- Result: S3 access enabled 5 minutes before earliest backup
3. **CronJob: Disable S3 Access** (`longhorn-disable-s3-access`)
- Schedule: `0 4 * * *` (4:00 AM daily)
- Action: Re-creates NetworkPolicy
- Result: S3 access blocked after 3-hour backup window
4. **RBAC Resources**
- ServiceAccount: `longhorn-netpol-manager`
- Role: Permissions to manage NetworkPolicies
- RoleBinding: Binds role to service account
## Benefits
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Daily S3 API Calls** | 145,000+ | 5,000-10,000 | **93% reduction** |
| **Cost Impact** | Exceeds free tier | Within free tier | **$X/month savings** |
| **Automation** | Manual intervention | Fully automated | **Zero manual work** |
| **Backup Reliability** | Compromised | Maintained | **No impact** |
## Backup Schedule
| Type | Schedule | Retention | Window |
|------|----------|-----------|--------|
| **Daily** | 2:00 AM | 7 days | 12:55 AM - 4:00 AM |
| **Weekly** | 1:00 AM Sundays | 4 weeks | Same window |
## FluxCD Integration
**Critical Design Decision**: The NetworkPolicy is **dynamically managed by CronJobs**, NOT by Flux.
### Why This Matters
Flux continuously reconciles resources to match the Git repository state. If the NetworkPolicy were managed by Flux:
- CronJob deletes NetworkPolicy at 12:55 AM → Flux recreates it within minutes
- S3 remains blocked during backup window → Backups fail ❌
### How We Solved It
1. **NetworkPolicy is NOT in Git** - Only the CronJobs and RBAC are in `network-policy-s3-block.yaml`
2. **CronJobs are managed by Flux** - Flux ensures they exist and run on schedule
3. **NetworkPolicy is created by CronJob** - Without Flux labels/ownership
4. **Flux ignores the NetworkPolicy** - Not in Flux's inventory, so Flux won't touch it
### Verification
```bash
# Check Flux inventory (NetworkPolicy should NOT be listed)
kubectl get kustomization -n flux-system longhorn -o jsonpath='{.status.inventory.entries[*].id}' | grep -i network
# (Should return nothing)
# Check NetworkPolicy exists (managed by CronJobs)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# (Should exist)
```
## Deployment
### Files Modified/Created
1.`network-policy-s3-block.yaml` - **NEW**: CronJobs and RBAC (NOT the NetworkPolicy itself)
2.`kustomization.yaml` - Added new file to resources
3.`BACKUP-GUIDE.md` - Updated with new solution documentation
4.`S3-API-OPTIMIZATION.md` - **NEW**: This implementation summary
5.`config-map.yaml` - Kept backup target configured (no changes needed)
6.`longhorn.yaml` - Reverted `backupstorePollInterval` (not needed)
### Deployment Steps
1. **Commit and push** changes to your k8s-fleet branch
2. **FluxCD will automatically apply** the new NetworkPolicy and CronJobs
3. **Monitor for one backup cycle**:
```bash
# Watch CronJobs
kubectl get cronjobs -n longhorn-system -w
# Check NetworkPolicy status
kubectl get networkpolicy -n longhorn-system
# Verify backups complete
kubectl get backups -n longhorn-system
```
### Verification Steps
#### Day 1: Initial Deployment
```bash
# 1. Verify NetworkPolicy is active (should exist immediately)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# 2. Verify CronJobs are scheduled
kubectl get cronjobs -n longhorn-system | grep longhorn-.*-s3-access
# 3. Test: S3 access should be blocked
kubectl exec -n longhorn-system deploy/longhorn-ui -- curl -I https://<B2_ENDPOINT>
# Expected: Connection timeout or network error
```
#### Day 2: After First Backup Window
```bash
# 1. Check if CronJob ran successfully (should see completed job at 12:55 AM)
kubectl get jobs -n longhorn-system | grep enable-s3-access
# 2. Verify backups completed (check after 4:00 AM)
kubectl get backups -n longhorn-system
# Should see new backups with recent timestamps
# 3. Confirm NetworkPolicy was re-applied (after 4:00 AM)
kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
# Should exist again
# 4. Check CronJob logs
kubectl logs -n longhorn-system job/longhorn-enable-s3-access-<timestamp>
kubectl logs -n longhorn-system job/longhorn-disable-s3-access-<timestamp>
```
#### Week 1: Monitor S3 API Usage
```bash
# Monitor Backblaze B2 dashboard
# → Daily Class C transactions should drop from 145,000 to 5,000-10,000
# → Verify calls only occur during 1-4 AM window
```
## Manual Backup Outside Window
If you need to create a backup outside the scheduled window:
```bash
# 1. Temporarily remove NetworkPolicy
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
# 2. Create backup via Longhorn UI or:
kubectl create -f - <<EOF
apiVersion: longhorn.io/v1beta2
kind: Backup
metadata:
name: manual-backup-$(date +%s)
namespace: longhorn-system
spec:
snapshotName: <snapshot-name>
labels:
backup-type: manual
EOF
# 3. Wait for backup to complete
kubectl get backup -n longhorn-system manual-backup-* -w
# 4. Restore NetworkPolicy
kubectl apply -f manifests/infrastructure/longhorn/network-policy-s3-block.yaml
```
Or simply wait until the next automatic re-application at 4:00 AM.
## Troubleshooting
### NetworkPolicy Not Blocking S3
**Symptom**: S3 calls continue despite NetworkPolicy being active
**Check**:
```bash
# Verify NetworkPolicy is applied
kubectl describe networkpolicy -n longhorn-system longhorn-block-s3-access
# Check if CNI supports NetworkPolicies (Cilium does)
kubectl get pods -n kube-system | grep cilium
```
### Backups Failing
**Symptom**: Backups fail during scheduled window
**Check**:
```bash
# Verify NetworkPolicy was removed during backup window
kubectl get networkpolicy -n longhorn-system
# Should NOT exist between 12:55 AM - 4:00 AM
# Check enable-s3-access CronJob ran
kubectl get jobs -n longhorn-system | grep enable
# Check Longhorn manager logs
kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100
```
### CronJobs Not Running
**Symptom**: CronJobs never execute
**Check**:
```bash
# Verify CronJobs exist and are scheduled
kubectl get cronjobs -n longhorn-system -o wide
# Check events
kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | grep CronJob
# Manually trigger a job
kubectl create job -n longhorn-system test-enable --from=cronjob/longhorn-enable-s3-access
```
## Future Enhancements
1. **Adjust Window Size**: If backups consistently complete faster than 3 hours, reduce window to 2 hours (change disable CronJob to `0 3 * * *`)
2. **Alerting**: Add Prometheus alerts for:
- Backup failures during window
- CronJob execution failures
- NetworkPolicy re-creation failures
3. **Metrics**: Track actual S3 API call counts via Backblaze B2 API and alert if threshold exceeded
## References
- [Longhorn Issue #1547 - Excessive S3 Calls](https://github.com/longhorn/longhorn/issues/1547)
- [Community NetworkPolicy Solution](https://github.com/longhorn/longhorn/issues/1547#issuecomment-3395447100)
- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
- [Kubernetes NetworkPolicy Documentation](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
## Success Metrics
After 1 week of operation, you should observe:
- ✅ S3 API calls reduced by 85-93%
- ✅ Backblaze costs within free tier
- ✅ All scheduled backups completing successfully
- ✅ Zero manual intervention required
- ✅ Longhorn polls fail silently (network errors) outside backup window

View File

@@ -0,0 +1,200 @@
# Longhorn S3 API Call Reduction - Final Solution
## Problem Summary
Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.
## Root Cause
Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.
## Solution History
### Attempt 1: NetworkPolicy-Based Access Control ❌
**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).
**Why It Failed**:
- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
- Longhorn manager pods couldn't perform leader election or webhook operations
- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
- Cilium/NetworkPolicy interaction complexity made it unreliable
**Files Created** (kept for reference):
- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
- Removed from `kustomization.yaml` but retained in repository
## Final Solution: Increased Poll Interval ✅
### Implementation
**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.
**Location**: `manifests/infrastructure/longhorn/config-map.yaml`
```yaml
data:
default-resource.yaml: |-
"backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
"backup-target-credential-secret": "backblaze-credentials"
"backupstore-poll-interval": "86400" # 24 hours
"virtual-hosted-style": "true"
```
### Why This Works
1. **Dramatic Reduction**: Polling happens once per day instead of continuously
2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally
3. **Simple**: No complex NetworkPolicies or CronJobs to manage
4. **Reliable**: Well-tested Longhorn configuration option
5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs
### Expected Results
| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** |
| **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 |
| **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ |
| **System Stability** | Affected by NetworkPolicy | Stable | ✅ |
## Current Status
**Applied**: ConfigMap updated with `backupstore-poll-interval: 86400`
**Verified**: Longhorn manager pods are 2/2 Ready
**Backups**: Continue working normally via recurring jobs
**Monitoring**: Backblaze API usage should drop to <1,000 calls/day
## Monitoring
### Check Longhorn Manager Health
```bash
kubectl get pods -n longhorn-system -l app=longhorn-manager
# Should show: 2/2 Ready for all pods
```
### Check Poll Interval Setting
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
# Should show: "backupstore-poll-interval": "86400"
```
### Check Backups Continue Working
```bash
kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
# Should see recent backups with "Completed" status
```
### Monitor Backblaze API Usage
1. Log into Backblaze B2 dashboard
2. Navigate to "Caps and Alerts"
3. Check "Class C Transactions" (includes `s3_list_objects`)
4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours
## Backup Schedule (Unchanged)
| Type | Schedule | Retention |
|------|----------|-----------|
| **Daily** | 2:00 AM | 7 days |
| **Weekly** | 1:00 AM Sundays | 4 weeks |
Backups are triggered by `RecurringJob` resources, not by polling.
## Why Polling Isn't Critical
**Longhorn's backupstore polling is primarily for**:
- Disaster Recovery (DR) volumes that need continuous sync
- Detecting backups created outside the cluster
**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
- ✅ Once-daily polling is more than sufficient
- ✅ Backups work independently of polling frequency
- ✅ Manual backups via Longhorn UI still work immediately
## Troubleshooting
### If Pods Show 1/2 Ready
**Symptom**: Longhorn manager pods stuck at 1/2 Ready
**Cause**: NetworkPolicy may have been accidentally applied
**Solution**:
```bash
# Check for NetworkPolicy
kubectl get networkpolicy -n longhorn-system
# If found, delete it
kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
# Wait 30 seconds
sleep 30
# Verify pods recover
kubectl get pods -n longhorn-system -l app=longhorn-manager
```
### If S3 API Calls Remain High
**Check poll interval is applied**:
```bash
kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
```
**Restart Longhorn managers to pick up changes**:
```bash
kubectl rollout restart daemonset -n longhorn-system longhorn-manager
```
### If Backups Fail
Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:
```bash
# Check recurring jobs
kubectl get recurringjobs -n longhorn-system
# Check recent backup jobs
kubectl get jobs -n longhorn-system | grep backup
# Check backup target connectivity (should work anytime)
MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
```
## References
- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue
- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
- Longhorn version: v1.9.0
## Files Modified
1.`config-map.yaml` - Updated `backupstore-poll-interval` to 86400
2.`kustomization.yaml` - Removed network-policy-s3-block.yaml reference
3.`network-policy-s3-block.yaml` - Retained for reference (not applied)
4.`S3-API-SOLUTION-FINAL.md` - This document
## Lessons Learned
1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication
2. **Start simple**: Configuration-based solutions are often more reliable than complex automation
3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies
4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use
5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary
## Success Metrics
Monitor these over the next week:
- ✅ Longhorn manager pods: 2/2 Ready
- ✅ Daily backups: Completing successfully
- ✅ S3 API calls: <1,000/day (down from 145,000)
- ✅ Backblaze costs: Within free tier
- ✅ No manual intervention required

View File

@@ -0,0 +1,41 @@
apiVersion: v1
kind: Secret
metadata:
name: backblaze-credentials
namespace: longhorn-system
type: Opaque
stringData:
AWS_ACCESS_KEY_ID: ENC[AES256_GCM,data:OGCSNVoeABeigczChYkRTKjIsjEYDA+cNA==,iv:So6ipxl+te3LkPbtyOwixnvv4DPbzl0yCGT8cqPgPbY=,tag:ApaM+bBqi9BJU/EVraKWrQ==,type:str]
AWS_SECRET_ACCESS_KEY: ENC[AES256_GCM,data:EMFNPCdt/V+2d4xnVARNTBBpY3UTqvpN3LezT/TZ7w==,iv:Q5pNnuKX+lUt/V4xpgF2Zg1q6e1znvG+laDNrLIrgBY=,tag:xGF/SvAJ9+tfuB7QdirAhw==,type:str]
AWS_ENDPOINTS: ENC[AES256_GCM,data:PSiRbt53KKK5XOOxIEiiycaFTriaJbuY0Z4Q9yC1xTwz9H/+hoOQ35w=,iv:pGwbR98F5C4N9Vca9btaJ9mKVS7XUkL8+Pva7TWTeTk=,tag:PxFllLIjj+wXDSXGuU/oLA==,type:str]
VIRTUAL_HOST_STYLE: ENC[AES256_GCM,data:a9RJ2Q==,iv:1VSTWiv1WFia0rgwkoZ9WftaLDdKtJabwiyY90AWvNY=,tag:tQZDFjqAABueZJ4bjD2PfA==,type:str]
sops:
lastmodified: "2025-06-30T18:44:50Z"
mac: ENC[AES256_GCM,data:5cdqJQiwoFwWfaNjtqNiaD5sY31979cdS4R6vBmNIKqd7ZaCMJLEKBm5lCLF7ow3+V17pxGhVu4EXX+rKVaNu6Qs6ivXtVM+kA0RutqPFnWDVfoZcnuW98IBjpyh4i9Y6Dra8zSda++Dt2R7Frouc/7lT74ANZYmSRN9WCYsTNg=,iv:s9c+YDDxAUdjWlzsx5jALux2UW5dtg56Pfi3FF4K0lU=,tag:U9bTTOZaqQ9lekpsIbUkWA==,type:str]
pgp:
- created_at: "2025-06-30T18:44:50Z"
enc: |-
-----BEGIN PGP MESSAGE-----
hF4DZT3mpHTS/JgSAQdAbJ88Og3rBkHDPJXf04xSp79A1rfXUDwsP2Wzz0rgI2ww
67XRMSSu2nUApEk08vf1ZF5ulewMQbnVjDDqvM8+BcgELllZVhnNW09NzMb5uPD+
1GgBCQIQXzEZTIi11OR5Z44vLkU64tF+yAPzA6j6y0lyemabOJLDB/XJiV/nq57h
+Udy8rg3sAmZt6FmBiTssKpxy6C6nFFSHVnTY7RhKg9p87AYKz36bSUI7TRhjZGb
f9U9EUo09Zh4JA==
=6fMP
-----END PGP MESSAGE-----
fp: B120595CA9A643B051731B32E67FF350227BA4E8
- created_at: "2025-06-30T18:44:50Z"
enc: |-
-----BEGIN PGP MESSAGE-----
hF4DSXzd60P2RKISAQdAPYpP5mUd4lVstNeGURyFoXbfPbaSH+IlSxgrh/wBfCEw
oI6DwAxkRAxLRwptJoQA9zU+N6LRN+o5kcHLMG/eNnUyNdAfNg17fs16UXf5N2Gi
1GgBCQIQRcLoTo+r7TyUUTxtPGIrQ7c5jy7WFRzm25XqLuvwTYipDTbQC5PyZu5R
4zFgx4ZfDayB3ldPMoAHZ8BeB2VTiQID+HRQGGbSSCM7U+HvzSXNuapNSGXpfWEA
qShkjhXz1sF7JQ==
=UqeC
-----END PGP MESSAGE-----
fp: 4A8AADB4EBAB9AF88EF7062373CECE06CC80D40C
encrypted_regex: ^(data|stringData)$
version: 3.10.2

View File

@@ -0,0 +1,78 @@
# Examples of how to apply S3 backup recurring jobs to volumes
# These are examples - you would apply these patterns to your actual PVCs/StorageClasses
---
# Example 1: Apply backup labels to an existing PVC
# This requires the PVC to be labeled as a recurring job source first
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: example-app-data
namespace: default
labels:
# Enable this PVC as a source for recurring job labels
recurring-job.longhorn.io/source: "enabled"
# Apply daily backup job group
recurring-job-group.longhorn.io/longhorn-s3-backup: "enabled"
# OR apply weekly backup job group (choose one)
# recurring-job-group.longhorn.io/longhorn-s3-backup-weekly: "enabled"
# OR apply specific recurring job by name
# recurring-job.longhorn.io/s3-backup-daily: "enabled"
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: longhorn
---
# Example 2: StorageClass with automatic backup assignment
# Any PVC created with this StorageClass will automatically get backups
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-backup-daily
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
fromBackup: ""
# Automatically assign backup jobs to volumes created with this StorageClass
recurringJobSelector: |
[
{
"name":"longhorn-s3-backup",
"isGroup":true
}
]
---
# Example 3: StorageClass for critical data with both daily and weekly backups
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-backup-critical
provisioner: driver.longhorn.io
allowVolumeExpansion: true
reclaimPolicy: Retain
volumeBindingMode: Immediate
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
fromBackup: ""
# Assign both daily and weekly backup groups
recurringJobSelector: |
[
{
"name":"longhorn-s3-backup",
"isGroup":true
},
{
"name":"longhorn-s3-backup-weekly",
"isGroup":true
}
]

View File

@@ -0,0 +1,37 @@
apiVersion: v1
kind: ConfigMap
metadata:
name: longhorn-default-resource
namespace: longhorn-system
data:
default-resource.yaml: ENC[AES256_GCM,data:vw2doEgVQYr1p9vHN9MLqoOSVM8LDBeowAvs2zOkwmGPue8QLxkxxpaFRy2zJH9igjXn30h1dsukmSZBfD9Y3cwrRcvuEZRMo3IsAJ6M1G/oeVpKc14Rll6/V48ZXPiB9qfn1upmUbJtl1EMyPc3vUetUD37fI81N3x4+bNK2OB6V8yGczuE3bJxIi4vV/Zay83Z3s0VyNRF4y18R3T0200Ib5KomANAZUMSCxKvjv4GOKHGYTVE5+C4LFxeOnPgmAtjV4x+lKcNCD1saNZ56yhVzsKVJClLdaRtIQ==,iv:s3OyHFQxd99NGwjXxHqa8rs9aYsl1vf+GCLNtvZ9nuc=,tag:2n8RLcHmp9ueKNm12MxjxQ==,type:str]
sops:
lastmodified: "2025-11-12T10:07:54Z"
mac: ENC[AES256_GCM,data:VBxywwWrVnKiyby+FzCdUlI89OkruNh1jyFE3cVXU/WR4FoCWclDSQ8v0FxT+/mS1/0eTX9XAXVIyqtzpAUU3YY3znq2CU8qsZa45B2PlPQP+7qGNBcyrpZZCsJxTYO/+jxr/9gV4pAJV27HFnyYfZDVZxArLUWQs32eJSdOfpc=,iv:7lbZjWhSEX7NisarWxCAAvw3+8v6wadq3/chrjWk2GQ=,tag:9AZyEuo7omdCbtRJ3YDarA==,type:str]
pgp:
- created_at: "2025-11-09T13:37:18Z"
enc: |-
-----BEGIN PGP MESSAGE-----
hF4DZT3mpHTS/JgSAQdAYMBTNc+JasEkeJpsS1d8OQ6iuhRTULXvFrGEia7gLXkw
+TRNuC4ZH+Lxmb5s3ImRX9dF1cMXoMGUCWJN/bScm5cLElNd2dHrtFoElVjn4/vI
1GgBCQIQ4jPpbQJym+xU5jS5rN3dtW6U60IYxX5rPvh0294bxgOzIIqI/oI/0qak
C4EYFsfH9plAOmvF56SnFX0PSczBjyUlngJ36NFHMN3any7qW/C0tYXFF3DDiOC3
kpa/moMr5CNTnQ==
=xVwB
-----END PGP MESSAGE-----
fp: B120595CA9A643B051731B32E67FF350227BA4E8
- created_at: "2025-11-09T13:37:18Z"
enc: |-
-----BEGIN PGP MESSAGE-----
hF4DSXzd60P2RKISAQdA9omTE+Cuy7BvMA8xfqsZv2o+Jh3QvOL+gZY/Z5CuVgIw
IBgwiVypHqwDf8loCVIdlo1/h5gctj/t11cxb2hKNRGQ0kFNLdpu5Mx+RbJZ/az/
1GgBCQIQB/gKeYbAqSxrJMKl/Q+6PfAXTAjH33K8IlDQKbF8q3QvoQDJJU3i0XwQ
ljhWRC/RZzO7hHXJqkR9z5sVIysHoEo+O9DZ0OzefjKb+GscdgSwJwGgsZzrVRXP
kSLdNO0eE5ubMQ==
=O/Lu
-----END PGP MESSAGE-----
fp: 4A8AADB4EBAB9AF88EF7062373CECE06CC80D40C
encrypted_regex: ^(data|stringData)$
version: 3.10.2

View File

@@ -0,0 +1,11 @@
---
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- longhorn.yaml
- storageclass.yaml
- backblaze-secret.yaml
- config-map.yaml
- recurring-job-s3-backup.yaml
- network-policy-s3-block.yaml

View File

@@ -0,0 +1,64 @@
---
apiVersion: source.toolkit.fluxcd.io/v1
kind: HelmRepository
metadata:
name: longhorn-repo
namespace: longhorn-system
spec:
interval: 5m0s
url: https://charts.longhorn.io
---
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: longhorn-release
namespace: longhorn-system
spec:
interval: 5m
chart:
spec:
chart: longhorn
version: v1.10.0
sourceRef:
kind: HelmRepository
name: longhorn-repo
namespace: longhorn-system
interval: 1m
values:
# Use hotfixed longhorn-manager image
image:
longhorn:
manager:
tag: v1.10.0-hotfix-1
defaultSettings:
defaultDataPath: /var/mnt/longhorn-storage
defaultReplicaCount: "2"
replicaNodeLevelSoftAntiAffinity: true
allowVolumeCreationWithDegradedAvailability: false
guaranteedInstanceManagerCpu: 5
createDefaultDiskLabeledNodes: true
# Multi-node optimized settings
storageMinimalAvailablePercentage: "20"
storageReservedPercentageForDefaultDisk: "15"
storageOverProvisioningPercentage: "200"
# Single replica for UI
service:
ui:
type: ClusterIP
# Longhorn UI replica count
longhornUI:
replicas: 1
# Enable metrics collection
metrics:
serviceMonitor:
enabled: true
longhornManager:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists
longhornDriver:
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/control-plane
operator: Exists

View File

@@ -0,0 +1,8 @@
---
apiVersion: v1
kind: Namespace
metadata:
name: longhorn-system
labels:
pod-security.kubernetes.io/enforce: privileged
pod-security.kubernetes.io/enforce-version: latest

View File

@@ -0,0 +1,211 @@
---
# Longhorn S3 Access Control via NetworkPolicy
#
# NetworkPolicy that blocks external S3 access by default, with CronJobs to
# automatically remove it during backup windows (12:55 AM - 4:00 AM).
#
# Network Details:
# - Pod CIDR: 10.244.0.0/16 (within 10.0.0.0/8)
# - Service CIDR: 10.96.0.0/12 (within 10.0.0.0/8)
# - VLAN Network: 10.132.0.0/24 (within 10.0.0.0/8)
#
# How It Works:
# - NetworkPolicy is applied by default, blocking external S3 (Backblaze B2)
# - CronJob removes NetworkPolicy at 12:55 AM (5 min before earliest backup at 1 AM)
# - CronJob reapplies NetworkPolicy at 4:00 AM (after backup window closes)
# - Allows all internal cluster traffic (10.0.0.0/8) while blocking external S3
#
# Backup Schedule:
# - Daily backups: 2:00 AM
# - Weekly backups: 1:00 AM Sundays
# - Backup window: 12:55 AM - 4:00 AM (3 hours 5 minutes)
#
# See: BACKUP-GUIDE.md and S3-API-SOLUTION-FINAL.md for full documentation
---
# NetworkPolicy: Blocks S3 access by default
# This is applied initially, then managed by CronJobs below
# Using CiliumNetworkPolicy for better API server support via toEntities
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: longhorn-block-s3-access
namespace: longhorn-system
labels:
app: longhorn
purpose: s3-access-control
spec:
description: "Block external S3 access while allowing internal cluster communication"
endpointSelector:
matchLabels:
app: longhorn-manager
egress:
# Allow DNS to kube-system namespace
- toEndpoints:
- matchLabels:
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "53"
protocol: TCP
# Explicitly allow Kubernetes API server (critical for Longhorn)
# Cilium handles this specially - kube-apiserver entity is required
- toEntities:
- kube-apiserver
# Allow all internal cluster traffic (10.0.0.0/8)
# This includes:
# - Pod CIDR: 10.244.0.0/16
# - Service CIDR: 10.96.0.0/12 (API server already covered above)
# - VLAN Network: 10.132.0.0/24
# - All other internal 10.x.x.x addresses
- toCIDR:
- 10.0.0.0/8
# Allow pod-to-pod communication within cluster
# The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
# This explicit rule ensures instance-manager pods are reachable
- toEntities:
- cluster
# Block all other egress (including external S3 like Backblaze B2)
---
# RBAC for CronJobs that manage the NetworkPolicy
apiVersion: v1
kind: ServiceAccount
metadata:
name: longhorn-netpol-manager
namespace: longhorn-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: longhorn-netpol-manager
namespace: longhorn-system
rules:
- apiGroups: ["cilium.io"]
resources: ["ciliumnetworkpolicies"]
verbs: ["get", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: longhorn-netpol-manager
namespace: longhorn-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: longhorn-netpol-manager
subjects:
- kind: ServiceAccount
name: longhorn-netpol-manager
namespace: longhorn-system
---
# CronJob: Remove NetworkPolicy before backups (12:55 AM daily)
# This allows S3 access during the backup window
apiVersion: batch/v1
kind: CronJob
metadata:
name: longhorn-enable-s3-access
namespace: longhorn-system
labels:
app: longhorn
purpose: s3-access-control
spec:
# Run at 12:55 AM daily (5 minutes before earliest backup at 1:00 AM Sunday weekly)
schedule: "55 0 * * *"
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 2
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
metadata:
labels:
app: longhorn-netpol-manager
spec:
serviceAccountName: longhorn-netpol-manager
restartPolicy: OnFailure
containers:
- name: delete-netpol
image: bitnami/kubectl:latest
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- |
echo "Removing CiliumNetworkPolicy to allow S3 access for backups..."
kubectl delete ciliumnetworkpolicy longhorn-block-s3-access -n longhorn-system --ignore-not-found=true
echo "S3 access enabled. Backups can proceed."
---
# CronJob: Re-apply NetworkPolicy after backups (4:00 AM daily)
# This blocks S3 access after the backup window closes
apiVersion: batch/v1
kind: CronJob
metadata:
name: longhorn-disable-s3-access
namespace: longhorn-system
labels:
app: longhorn
purpose: s3-access-control
spec:
# Run at 4:00 AM daily (gives 3 hours 5 minutes for backups to complete)
schedule: "0 4 * * *"
successfulJobsHistoryLimit: 2
failedJobsHistoryLimit: 2
concurrencyPolicy: Forbid
jobTemplate:
spec:
template:
metadata:
labels:
app: longhorn-netpol-manager
spec:
serviceAccountName: longhorn-netpol-manager
restartPolicy: OnFailure
containers:
- name: create-netpol
image: bitnami/kubectl:latest
imagePullPolicy: IfNotPresent
command:
- /bin/sh
- -c
- |
echo "Re-applying CiliumNetworkPolicy to block S3 access..."
kubectl apply -f - <<EOF
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata:
name: longhorn-block-s3-access
namespace: longhorn-system
labels:
app: longhorn
purpose: s3-access-control
spec:
description: "Block external S3 access while allowing internal cluster communication"
endpointSelector:
matchLabels:
app: longhorn-manager
egress:
# Allow DNS to kube-system namespace
- toEndpoints:
- matchLabels:
k8s-app: kube-dns
toPorts:
- ports:
- port: "53"
protocol: UDP
- port: "53"
protocol: TCP
# Explicitly allow Kubernetes API server (critical for Longhorn)
- toEntities:
- kube-apiserver
# Allow all internal cluster traffic (10.0.0.0/8)
- toCIDR:
- 10.0.0.0/8
# Allow pod-to-pod communication within cluster
# The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
- toEntities:
- cluster
# Block all other egress (including external S3)
EOF
echo "S3 access blocked. Polling stopped until next backup window."

View File

@@ -0,0 +1,34 @@
---
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: s3-backup-daily
namespace: longhorn-system
spec:
cron: "0 2 * * *" # Daily at 2 AM
task: "backup"
groups:
- longhorn-s3-backup
retain: 7 # Keep 7 daily backups
concurrency: 2 # Max 2 concurrent backup jobs
labels:
recurring-job: "s3-backup-daily"
backup-type: "daily"
---
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: s3-backup-weekly
namespace: longhorn-system
spec:
cron: "0 1 * * 0" # Weekly on Sunday at 1 AM
task: "backup"
groups:
- longhorn-s3-backup-weekly
retain: 4 # Keep 4 weekly backups
concurrency: 1 # Only 1 concurrent weekly backup
labels:
recurring-job: "s3-backup-weekly"
backup-type: "weekly"
parameters:
full-backup-interval: "1" # Full backup every other week (alternating full/incremental)

View File

@@ -0,0 +1,81 @@
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-retain
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "xfs"
dataLocality: "best-effort"
reclaimPolicy: Retain
volumeBindingMode: Immediate
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-delete
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "2"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "xfs"
dataLocality: "best-effort"
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-single-delete
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
numberOfReplicas: "1"
staleReplicaTimeout: "2880"
fromBackup: ""
fsType: "xfs"
dataLocality: "best-effort"
reclaimPolicy: Delete
volumeBindingMode: Immediate
---
# Redis-specific StorageClass
# Single replica as Redis handles replication at application level
# Note: volumeBindingMode is immutable after creation
# If this StorageClass already exists with matching configuration, Flux reconciliation
# may show an error but it's harmless - the existing StorageClass will continue to work.
# For new clusters, this will be created correctly.
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: longhorn-redis
annotations:
storageclass.kubernetes.io/is-default-class: "false"
provisioner: driver.longhorn.io
allowVolumeExpansion: true
parameters:
# Single replica as Redis handles replication at application level
numberOfReplicas: "1"
staleReplicaTimeout: "2880"
fsType: "xfs" # xfs to match existing Longhorn volumes
dataLocality: "strict-local" # Keep Redis data local to node
# Integrate with existing S3 backup infrastructure
recurringJobSelector: |
[
{
"name":"longhorn-s3-backup",
"isGroup":true
}
]
reclaimPolicy: Delete
volumeBindingMode: Immediate