redaction (#1)

Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
2025-12-24 13:40:47 +00:00
parent 612235d52b
commit 7327d77dcd
333 changed files with 39286 additions and 1 deletions
--- a/manifests/infrastructure/longhorn/S3-API-OPTIMIZATION.md
+++ b/manifests/infrastructure/longhorn/S3-API-OPTIMIZATION.md
@@ -0,0 +1,277 @@
+# Longhorn S3 API Call Optimization - Implementation Summary
+
+## Problem Statement
+
+Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) and incurred significant costs.
+
+### Root Cause
+
+Even with `backupstore-poll-interval` set to `0`, Longhorn manager pods continuously poll the S3 backup target to check for new backups. With 3 manager pods (one per node) polling independently, this resulted in excessive API calls.
+
+Reference: [Longhorn GitHub Issue #1547](https://github.com/longhorn/longhorn/issues/1547)
+
+## Solution: NetworkPolicy-Based Access Control
+
+Inspired by [this community solution](https://github.com/longhorn/longhorn/issues/1547#issuecomment-3395447100), we implemented **time-based network access control** using Kubernetes NetworkPolicies and CronJobs.
+
+### Architecture
+
+```
+┌─────────────────────────────────────────────────┐
+│           Normal State (21 hours/day)           │
+│  NetworkPolicy BLOCKS S3 access                 │
+│  → Longhorn polls fail at network layer         │
+│  → S3 API calls: 0                              │
+└─────────────────────────────────────────────────┘
+                      ▼
+┌─────────────────────────────────────────────────┐
+│      Backup Window (3 hours/day: 1-4 AM)        │
+│  CronJob REMOVES NetworkPolicy at 12:55 AM      │
+│  → S3 access enabled                            │
+│  → Recurring backups run automatically          │
+│  → CronJob RESTORES NetworkPolicy at 4:00 AM    │
+│  → S3 API calls: ~5,000-10,000/day             │
+└─────────────────────────────────────────────────┘
+```
+
+### Components
+
+1. **NetworkPolicy** (`longhorn-block-s3-access`) - **Dynamically Managed**
+   - Targets: `app=longhorn-manager` pods
+   - Blocks: All egress except DNS and intra-cluster
+   - Effect: Prevents S3 API calls at network layer
+   - **Important**: NOT managed by Flux - only the CronJobs control it
+   - Flux manages the CronJobs/RBAC, but NOT the NetworkPolicy itself
+
+2. **CronJob: Enable S3 Access** (`longhorn-enable-s3-access`)
+   - Schedule: `55 0 * * *` (12:55 AM daily)
+   - Action: Deletes NetworkPolicy
+   - Result: S3 access enabled 5 minutes before earliest backup
+
+3. **CronJob: Disable S3 Access** (`longhorn-disable-s3-access`)
+   - Schedule: `0 4 * * *` (4:00 AM daily)
+   - Action: Re-creates NetworkPolicy
+   - Result: S3 access blocked after 3-hour backup window
+
+4. **RBAC Resources**
+   - ServiceAccount: `longhorn-netpol-manager`
+   - Role: Permissions to manage NetworkPolicies
+   - RoleBinding: Binds role to service account
+
+## Benefits
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| **Daily S3 API Calls** | 145,000+ | 5,000-10,000 | **93% reduction** |
+| **Cost Impact** | Exceeds free tier | Within free tier | **$X/month savings** |
+| **Automation** | Manual intervention | Fully automated | **Zero manual work** |
+| **Backup Reliability** | Compromised | Maintained | **No impact** |
+
+## Backup Schedule
+
+| Type | Schedule | Retention | Window |
+|------|----------|-----------|--------|
+| **Daily** | 2:00 AM | 7 days | 12:55 AM - 4:00 AM |
+| **Weekly** | 1:00 AM Sundays | 4 weeks | Same window |
+
+## FluxCD Integration
+
+**Critical Design Decision**: The NetworkPolicy is **dynamically managed by CronJobs**, NOT by Flux.
+
+### Why This Matters
+
+Flux continuously reconciles resources to match the Git repository state. If the NetworkPolicy were managed by Flux:
+- CronJob deletes NetworkPolicy at 12:55 AM → Flux recreates it within minutes
+- S3 remains blocked during backup window → Backups fail ❌
+
+### How We Solved It
+
+1. **NetworkPolicy is NOT in Git** - Only the CronJobs and RBAC are in `network-policy-s3-block.yaml`
+2. **CronJobs are managed by Flux** - Flux ensures they exist and run on schedule
+3. **NetworkPolicy is created by CronJob** - Without Flux labels/ownership
+4. **Flux ignores the NetworkPolicy** - Not in Flux's inventory, so Flux won't touch it
+
+### Verification
+
+```bash
+# Check Flux inventory (NetworkPolicy should NOT be listed)
+kubectl get kustomization -n flux-system longhorn -o jsonpath='{.status.inventory.entries[*].id}' | grep -i network
+# (Should return nothing)
+
+# Check NetworkPolicy exists (managed by CronJobs)
+kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
+# (Should exist)
+```
+
+## Deployment
+
+### Files Modified/Created
+
+1. ✅ `network-policy-s3-block.yaml` - **NEW**: CronJobs and RBAC (NOT the NetworkPolicy itself)
+2. ✅ `kustomization.yaml` - Added new file to resources
+3. ✅ `BACKUP-GUIDE.md` - Updated with new solution documentation
+4. ✅ `S3-API-OPTIMIZATION.md` - **NEW**: This implementation summary
+5. ✅ `config-map.yaml` - Kept backup target configured (no changes needed)
+6. ✅ `longhorn.yaml` - Reverted `backupstorePollInterval` (not needed)
+
+### Deployment Steps
+
+1. **Commit and push** changes to your k8s-fleet branch
+2. **FluxCD will automatically apply** the new NetworkPolicy and CronJobs
+3. **Monitor for one backup cycle**:
+   ```bash
+   # Watch CronJobs
+   kubectl get cronjobs -n longhorn-system -w
+   
+   # Check NetworkPolicy status
+   kubectl get networkpolicy -n longhorn-system
+   
+   # Verify backups complete
+   kubectl get backups -n longhorn-system
+   ```
+
+### Verification Steps
+
+#### Day 1: Initial Deployment
+```bash
+# 1. Verify NetworkPolicy is active (should exist immediately)
+kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
+
+# 2. Verify CronJobs are scheduled
+kubectl get cronjobs -n longhorn-system | grep longhorn-.*-s3-access
+
+# 3. Test: S3 access should be blocked
+kubectl exec -n longhorn-system deploy/longhorn-ui -- curl -I https://<B2_ENDPOINT>
+# Expected: Connection timeout or network error
+```
+
+#### Day 2: After First Backup Window
+```bash
+# 1. Check if CronJob ran successfully (should see completed job at 12:55 AM)
+kubectl get jobs -n longhorn-system | grep enable-s3-access
+
+# 2. Verify backups completed (check after 4:00 AM)
+kubectl get backups -n longhorn-system
+# Should see new backups with recent timestamps
+
+# 3. Confirm NetworkPolicy was re-applied (after 4:00 AM)
+kubectl get networkpolicy -n longhorn-system longhorn-block-s3-access
+# Should exist again
+
+# 4. Check CronJob logs
+kubectl logs -n longhorn-system job/longhorn-enable-s3-access-<timestamp>
+kubectl logs -n longhorn-system job/longhorn-disable-s3-access-<timestamp>
+```
+
+#### Week 1: Monitor S3 API Usage
+```bash
+# Monitor Backblaze B2 dashboard
+# → Daily Class C transactions should drop from 145,000 to 5,000-10,000
+# → Verify calls only occur during 1-4 AM window
+```
+
+## Manual Backup Outside Window
+
+If you need to create a backup outside the scheduled window:
+
+```bash
+# 1. Temporarily remove NetworkPolicy
+kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
+
+# 2. Create backup via Longhorn UI or:
+kubectl create -f - <<EOF
+apiVersion: longhorn.io/v1beta2
+kind: Backup
+metadata:
+  name: manual-backup-$(date +%s)
+  namespace: longhorn-system
+spec:
+  snapshotName: <snapshot-name>
+  labels:
+    backup-type: manual
+EOF
+
+# 3. Wait for backup to complete
+kubectl get backup -n longhorn-system manual-backup-* -w
+
+# 4. Restore NetworkPolicy
+kubectl apply -f manifests/infrastructure/longhorn/network-policy-s3-block.yaml
+```
+
+Or simply wait until the next automatic re-application at 4:00 AM.
+
+## Troubleshooting
+
+### NetworkPolicy Not Blocking S3
+
+**Symptom**: S3 calls continue despite NetworkPolicy being active
+
+**Check**:
+```bash
+# Verify NetworkPolicy is applied
+kubectl describe networkpolicy -n longhorn-system longhorn-block-s3-access
+
+# Check if CNI supports NetworkPolicies (Cilium does)
+kubectl get pods -n kube-system | grep cilium
+```
+
+### Backups Failing
+
+**Symptom**: Backups fail during scheduled window
+
+**Check**:
+```bash
+# Verify NetworkPolicy was removed during backup window
+kubectl get networkpolicy -n longhorn-system
+# Should NOT exist between 12:55 AM - 4:00 AM
+
+# Check enable-s3-access CronJob ran
+kubectl get jobs -n longhorn-system | grep enable
+
+# Check Longhorn manager logs
+kubectl logs -n longhorn-system -l app=longhorn-manager --tail=100
+```
+
+### CronJobs Not Running
+
+**Symptom**: CronJobs never execute
+
+**Check**:
+```bash
+# Verify CronJobs exist and are scheduled
+kubectl get cronjobs -n longhorn-system -o wide
+
+# Check events
+kubectl get events -n longhorn-system --sort-by='.lastTimestamp' | grep CronJob
+
+# Manually trigger a job
+kubectl create job -n longhorn-system test-enable --from=cronjob/longhorn-enable-s3-access
+```
+
+## Future Enhancements
+
+1. **Adjust Window Size**: If backups consistently complete faster than 3 hours, reduce window to 2 hours (change disable CronJob to `0 3 * * *`)
+
+2. **Alerting**: Add Prometheus alerts for:
+   - Backup failures during window
+   - CronJob execution failures
+   - NetworkPolicy re-creation failures
+
+3. **Metrics**: Track actual S3 API call counts via Backblaze B2 API and alert if threshold exceeded
+
+## References
+
+- [Longhorn Issue #1547 - Excessive S3 Calls](https://github.com/longhorn/longhorn/issues/1547)
+- [Community NetworkPolicy Solution](https://github.com/longhorn/longhorn/issues/1547#issuecomment-3395447100)
+- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
+- [Kubernetes NetworkPolicy Documentation](https://kubernetes.io/docs/concepts/services-networking/network-policies/)
+
+## Success Metrics
+
+After 1 week of operation, you should observe:
+- ✅ S3 API calls reduced by 85-93%
+- ✅ Backblaze costs within free tier
+- ✅ All scheduled backups completing successfully
+- ✅ Zero manual intervention required
+- ✅ Longhorn polls fail silently (network errors) outside backup window
+
--- a/manifests/infrastructure/longhorn/S3-API-SOLUTION-FINAL.md
+++ b/manifests/infrastructure/longhorn/S3-API-SOLUTION-FINAL.md
@@ -0,0 +1,200 @@
+# Longhorn S3 API Call Reduction - Final Solution
+
+## Problem Summary
+
+Longhorn was making **145,000+ Class C API calls/day** to Backblaze B2, primarily `s3_list_objects` operations. This exceeded Backblaze's free tier (2,500 calls/day) by 58x, incurring significant costs.
+
+## Root Cause
+
+Longhorn's `backupstore-poll-interval` setting controls how frequently Longhorn managers poll the S3 backup target to check for new backups (primarily for Disaster Recovery volumes). With 3 manager pods and a low poll interval, this resulted in excessive API calls.
+
+## Solution History
+
+### Attempt 1: NetworkPolicy-Based Access Control ❌
+
+**Approach**: Use NetworkPolicies dynamically managed by CronJobs to block S3 access outside backup windows (12:55 AM - 4:00 AM).
+
+**Why It Failed**:
+- NetworkPolicies that blocked external S3 also inadvertently blocked the Kubernetes API server
+- Longhorn manager pods couldn't perform leader election or webhook operations
+- Pods entered 1/2 Ready state with errors: `error retrieving resource lock longhorn-system/longhorn-manager-webhook-lock: dial tcp 10.96.0.1:443: i/o timeout`
+- Even with CIDR-based rules (10.244.0.0/16 for pods, 10.96.0.0/12 for services), the NetworkPolicy was too aggressive
+- Cilium/NetworkPolicy interaction complexity made it unreliable
+
+**Files Created** (kept for reference):
+- `network-policy-s3-block.yaml` - CronJobs and NetworkPolicy definitions
+- Removed from `kustomization.yaml` but retained in repository
+
+## Final Solution: Increased Poll Interval ✅
+
+### Implementation
+
+**Change**: Set `backupstore-poll-interval` to `86400` seconds (24 hours) instead of `0`.
+
+**Location**: `manifests/infrastructure/longhorn/config-map.yaml`
+
+```yaml
+data:
+  default-resource.yaml: |-
+    "backup-target": "s3://<BUCKET_NAME>@<B2_ENDPOINT>/longhorn-backup"
+    "backup-target-credential-secret": "backblaze-credentials"
+    "backupstore-poll-interval": "86400"  # 24 hours
+    "virtual-hosted-style": "true"
+```
+
+### Why This Works
+
+1. **Dramatic Reduction**: Polling happens once per day instead of continuously
+2. **No Breakage**: Kubernetes API, webhooks, and leader election work normally
+3. **Simple**: No complex NetworkPolicies or CronJobs to manage
+4. **Reliable**: Well-tested Longhorn configuration option
+5. **Sufficient**: Backups don't require frequent polling since we use scheduled recurring jobs
+
+### Expected Results
+
+| Metric | Before | After | Improvement |
+|--------|--------|-------|-------------|
+| **Poll Frequency** | Every ~5 seconds | Every 24 hours | **99.99% reduction** |
+| **Daily S3 API Calls** | 145,000+ | ~300-1,000 | **99% reduction** 📉 |
+| **Backblaze Costs** | Exceeds free tier | Within free tier | ✅ |
+| **System Stability** | Affected by NetworkPolicy | Stable | ✅ |
+
+## Current Status
+
+✅ **Applied**: ConfigMap updated with `backupstore-poll-interval: 86400`  
+✅ **Verified**: Longhorn manager pods are 2/2 Ready  
+✅ **Backups**: Continue working normally via recurring jobs  
+✅ **Monitoring**: Backblaze API usage should drop to <1,000 calls/day  
+
+## Monitoring
+
+### Check Longhorn Manager Health
+
+```bash
+kubectl get pods -n longhorn-system -l app=longhorn-manager
+# Should show: 2/2 Ready for all pods
+```
+
+### Check Poll Interval Setting
+
+```bash
+kubectl get configmap -n longhorn-system longhorn-default-resource -o jsonpath='{.data.default-resource\.yaml}' | grep backupstore-poll-interval
+# Should show: "backupstore-poll-interval": "86400"
+```
+
+### Check Backups Continue Working
+
+```bash
+kubectl get backups -n longhorn-system --sort-by=.status.snapshotCreatedAt | tail -10
+# Should see recent backups with "Completed" status
+```
+
+### Monitor Backblaze API Usage
+
+1. Log into Backblaze B2 dashboard
+2. Navigate to "Caps and Alerts"
+3. Check "Class C Transactions" (includes `s3_list_objects`)
+4. **Expected**: Should drop from 145,000/day to ~300-1,000/day within 24-48 hours
+
+## Backup Schedule (Unchanged)
+
+| Type | Schedule | Retention |
+|------|----------|-----------|
+| **Daily** | 2:00 AM | 7 days |
+| **Weekly** | 1:00 AM Sundays | 4 weeks |
+
+Backups are triggered by `RecurringJob` resources, not by polling.
+
+## Why Polling Isn't Critical
+
+**Longhorn's backupstore polling is primarily for**:
+- Disaster Recovery (DR) volumes that need continuous sync
+- Detecting backups created outside the cluster
+
+**We don't use DR volumes**, and all backups are created by recurring jobs within the cluster, so:
+- ✅ Once-daily polling is more than sufficient
+- ✅ Backups work independently of polling frequency
+- ✅ Manual backups via Longhorn UI still work immediately
+
+## Troubleshooting
+
+### If Pods Show 1/2 Ready
+
+**Symptom**: Longhorn manager pods stuck at 1/2 Ready
+
+**Cause**: NetworkPolicy may have been accidentally applied
+
+**Solution**:
+```bash
+# Check for NetworkPolicy
+kubectl get networkpolicy -n longhorn-system
+
+# If found, delete it
+kubectl delete networkpolicy -n longhorn-system longhorn-block-s3-access
+
+# Wait 30 seconds
+sleep 30
+
+# Verify pods recover
+kubectl get pods -n longhorn-system -l app=longhorn-manager
+```
+
+### If S3 API Calls Remain High
+
+**Check poll interval is applied**:
+```bash
+kubectl get configmap -n longhorn-system longhorn-default-resource -o yaml
+```
+
+**Restart Longhorn managers to pick up changes**:
+```bash
+kubectl rollout restart daemonset -n longhorn-system longhorn-manager
+```
+
+### If Backups Fail
+
+Backups should continue working normally since they're triggered by recurring jobs, not polling. If issues occur:
+
+```bash
+# Check recurring jobs
+kubectl get recurringjobs -n longhorn-system
+
+# Check recent backup jobs
+kubectl get jobs -n longhorn-system | grep backup
+
+# Check backup target connectivity (should work anytime)
+MANAGER_POD=$(kubectl get pods -n longhorn-system -l app=longhorn-manager --no-headers | head -1 | awk '{print $1}')
+kubectl exec -n longhorn-system "$MANAGER_POD" -c longhorn-manager -- curl -I https://<B2_ENDPOINT>
+```
+
+## References
+
+- [Longhorn Issue #1547](https://github.com/longhorn/longhorn/issues/1547) - Original excessive S3 calls issue
+- [Longhorn Backup Target Documentation](https://longhorn.io/docs/1.9.0/snapshots-and-backups/backup-and-restore/set-backup-target/)
+- Longhorn version: v1.9.0
+
+## Files Modified
+
+1. ✅ `config-map.yaml` - Updated `backupstore-poll-interval` to 86400
+2. ✅ `kustomization.yaml` - Removed network-policy-s3-block.yaml reference
+3. ✅ `network-policy-s3-block.yaml` - Retained for reference (not applied)
+4. ✅ `S3-API-SOLUTION-FINAL.md` - This document
+
+## Lessons Learned
+
+1. **NetworkPolicies are tricky**: Blocking external traffic can inadvertently block internal cluster communication
+2. **Start simple**: Configuration-based solutions are often more reliable than complex automation
+3. **Test thoroughly**: Always verify pods remain healthy after applying NetworkPolicies
+4. **Understand the feature**: Longhorn's polling is for DR volumes, which we don't use
+5. **24-hour polling is sufficient**: For non-DR use cases, frequent polling isn't necessary
+
+## Success Metrics
+
+Monitor these over the next week:
+
+- ✅ Longhorn manager pods: 2/2 Ready
+- ✅ Daily backups: Completing successfully
+- ✅ S3 API calls: <1,000/day (down from 145,000)
+- ✅ Backblaze costs: Within free tier
+- ✅ No manual intervention required
+
--- a/manifests/infrastructure/longhorn/backblaze-secret.yaml
+++ b/manifests/infrastructure/longhorn/backblaze-secret.yaml
@@ -0,0 +1,41 @@
+apiVersion: v1
+kind: Secret
+metadata:
+    name: backblaze-credentials
+    namespace: longhorn-system
+type: Opaque
+stringData:
+    AWS_ACCESS_KEY_ID: ENC[AES256_GCM,data:OGCSNVoeABeigczChYkRTKjIsjEYDA+cNA==,iv:So6ipxl+te3LkPbtyOwixnvv4DPbzl0yCGT8cqPgPbY=,tag:ApaM+bBqi9BJU/EVraKWrQ==,type:str]
+    AWS_SECRET_ACCESS_KEY: ENC[AES256_GCM,data:EMFNPCdt/V+2d4xnVARNTBBpY3UTqvpN3LezT/TZ7w==,iv:Q5pNnuKX+lUt/V4xpgF2Zg1q6e1znvG+laDNrLIrgBY=,tag:xGF/SvAJ9+tfuB7QdirAhw==,type:str]
+    AWS_ENDPOINTS: ENC[AES256_GCM,data:PSiRbt53KKK5XOOxIEiiycaFTriaJbuY0Z4Q9yC1xTwz9H/+hoOQ35w=,iv:pGwbR98F5C4N9Vca9btaJ9mKVS7XUkL8+Pva7TWTeTk=,tag:PxFllLIjj+wXDSXGuU/oLA==,type:str]
+    VIRTUAL_HOST_STYLE: ENC[AES256_GCM,data:a9RJ2Q==,iv:1VSTWiv1WFia0rgwkoZ9WftaLDdKtJabwiyY90AWvNY=,tag:tQZDFjqAABueZJ4bjD2PfA==,type:str]
+sops:
+    lastmodified: "2025-06-30T18:44:50Z"
+    mac: ENC[AES256_GCM,data:5cdqJQiwoFwWfaNjtqNiaD5sY31979cdS4R6vBmNIKqd7ZaCMJLEKBm5lCLF7ow3+V17pxGhVu4EXX+rKVaNu6Qs6ivXtVM+kA0RutqPFnWDVfoZcnuW98IBjpyh4i9Y6Dra8zSda++Dt2R7Frouc/7lT74ANZYmSRN9WCYsTNg=,iv:s9c+YDDxAUdjWlzsx5jALux2UW5dtg56Pfi3FF4K0lU=,tag:U9bTTOZaqQ9lekpsIbUkWA==,type:str]
+    pgp:
+        - created_at: "2025-06-30T18:44:50Z"
+          enc: |-
+            -----BEGIN PGP MESSAGE-----
+
+            hF4DZT3mpHTS/JgSAQdAbJ88Og3rBkHDPJXf04xSp79A1rfXUDwsP2Wzz0rgI2ww
+            67XRMSSu2nUApEk08vf1ZF5ulewMQbnVjDDqvM8+BcgELllZVhnNW09NzMb5uPD+
+            1GgBCQIQXzEZTIi11OR5Z44vLkU64tF+yAPzA6j6y0lyemabOJLDB/XJiV/nq57h
+            +Udy8rg3sAmZt6FmBiTssKpxy6C6nFFSHVnTY7RhKg9p87AYKz36bSUI7TRhjZGb
+            f9U9EUo09Zh4JA==
+            =6fMP
+            -----END PGP MESSAGE-----
+          fp: B120595CA9A643B051731B32E67FF350227BA4E8
+        - created_at: "2025-06-30T18:44:50Z"
+          enc: |-
+            -----BEGIN PGP MESSAGE-----
+
+            hF4DSXzd60P2RKISAQdAPYpP5mUd4lVstNeGURyFoXbfPbaSH+IlSxgrh/wBfCEw
+            oI6DwAxkRAxLRwptJoQA9zU+N6LRN+o5kcHLMG/eNnUyNdAfNg17fs16UXf5N2Gi
+            1GgBCQIQRcLoTo+r7TyUUTxtPGIrQ7c5jy7WFRzm25XqLuvwTYipDTbQC5PyZu5R
+            4zFgx4ZfDayB3ldPMoAHZ8BeB2VTiQID+HRQGGbSSCM7U+HvzSXNuapNSGXpfWEA
+            qShkjhXz1sF7JQ==
+            =UqeC
+            -----END PGP MESSAGE-----
+          fp: 4A8AADB4EBAB9AF88EF7062373CECE06CC80D40C
+    encrypted_regex: ^(data|stringData)$
+    version: 3.10.2
--- a/manifests/infrastructure/longhorn/backup-examples.yaml
+++ b/manifests/infrastructure/longhorn/backup-examples.yaml
@@ -0,0 +1,78 @@
+# Examples of how to apply S3 backup recurring jobs to volumes
+# These are examples - you would apply these patterns to your actual PVCs/StorageClasses
+
+---
+# Example 1: Apply backup labels to an existing PVC
+# This requires the PVC to be labeled as a recurring job source first
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: example-app-data
+  namespace: default
+  labels:
+    # Enable this PVC as a source for recurring job labels
+    recurring-job.longhorn.io/source: "enabled"
+    # Apply daily backup job group
+    recurring-job-group.longhorn.io/longhorn-s3-backup: "enabled"
+    # OR apply weekly backup job group (choose one)
+    # recurring-job-group.longhorn.io/longhorn-s3-backup-weekly: "enabled"
+    # OR apply specific recurring job by name
+    # recurring-job.longhorn.io/s3-backup-daily: "enabled"
+spec:
+  accessModes:
+    - ReadWriteOnce
+  resources:
+    requests:
+      storage: 10Gi
+  storageClassName: longhorn
+
+---
+# Example 2: StorageClass with automatic backup assignment
+# Any PVC created with this StorageClass will automatically get backups
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-backup-daily
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+reclaimPolicy: Retain
+volumeBindingMode: Immediate
+parameters:
+  numberOfReplicas: "2"
+  staleReplicaTimeout: "30"
+  fromBackup: ""
+  # Automatically assign backup jobs to volumes created with this StorageClass
+  recurringJobSelector: |
+    [
+      {
+        "name":"longhorn-s3-backup",
+        "isGroup":true
+      }
+    ]
+
+---
+# Example 3: StorageClass for critical data with both daily and weekly backups
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-backup-critical
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+reclaimPolicy: Retain
+volumeBindingMode: Immediate
+parameters:
+  numberOfReplicas: "2"
+  staleReplicaTimeout: "30"
+  fromBackup: ""
+  # Assign both daily and weekly backup groups
+  recurringJobSelector: |
+    [
+      {
+        "name":"longhorn-s3-backup",
+        "isGroup":true
+      },
+      {
+        "name":"longhorn-s3-backup-weekly", 
+        "isGroup":true
+      }
+    ] 
--- a/manifests/infrastructure/longhorn/config-map.yaml
+++ b/manifests/infrastructure/longhorn/config-map.yaml
@@ -0,0 +1,37 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+    name: longhorn-default-resource
+    namespace: longhorn-system
+data:
+    default-resource.yaml: ENC[AES256_GCM,data:vw2doEgVQYr1p9vHN9MLqoOSVM8LDBeowAvs2zOkwmGPue8QLxkxxpaFRy2zJH9igjXn30h1dsukmSZBfD9Y3cwrRcvuEZRMo3IsAJ6M1G/oeVpKc14Rll6/V48ZXPiB9qfn1upmUbJtl1EMyPc3vUetUD37fI81N3x4+bNK2OB6V8yGczuE3bJxIi4vV/Zay83Z3s0VyNRF4y18R3T0200Ib5KomANAZUMSCxKvjv4GOKHGYTVE5+C4LFxeOnPgmAtjV4x+lKcNCD1saNZ56yhVzsKVJClLdaRtIQ==,iv:s3OyHFQxd99NGwjXxHqa8rs9aYsl1vf+GCLNtvZ9nuc=,tag:2n8RLcHmp9ueKNm12MxjxQ==,type:str]
+sops:
+    lastmodified: "2025-11-12T10:07:54Z"
+    mac: ENC[AES256_GCM,data:VBxywwWrVnKiyby+FzCdUlI89OkruNh1jyFE3cVXU/WR4FoCWclDSQ8v0FxT+/mS1/0eTX9XAXVIyqtzpAUU3YY3znq2CU8qsZa45B2PlPQP+7qGNBcyrpZZCsJxTYO/+jxr/9gV4pAJV27HFnyYfZDVZxArLUWQs32eJSdOfpc=,iv:7lbZjWhSEX7NisarWxCAAvw3+8v6wadq3/chrjWk2GQ=,tag:9AZyEuo7omdCbtRJ3YDarA==,type:str]
+    pgp:
+        - created_at: "2025-11-09T13:37:18Z"
+          enc: |-
+            -----BEGIN PGP MESSAGE-----
+
+            hF4DZT3mpHTS/JgSAQdAYMBTNc+JasEkeJpsS1d8OQ6iuhRTULXvFrGEia7gLXkw
+            +TRNuC4ZH+Lxmb5s3ImRX9dF1cMXoMGUCWJN/bScm5cLElNd2dHrtFoElVjn4/vI
+            1GgBCQIQ4jPpbQJym+xU5jS5rN3dtW6U60IYxX5rPvh0294bxgOzIIqI/oI/0qak
+            C4EYFsfH9plAOmvF56SnFX0PSczBjyUlngJ36NFHMN3any7qW/C0tYXFF3DDiOC3
+            kpa/moMr5CNTnQ==
+            =xVwB
+            -----END PGP MESSAGE-----
+          fp: B120595CA9A643B051731B32E67FF350227BA4E8
+        - created_at: "2025-11-09T13:37:18Z"
+          enc: |-
+            -----BEGIN PGP MESSAGE-----
+
+            hF4DSXzd60P2RKISAQdA9omTE+Cuy7BvMA8xfqsZv2o+Jh3QvOL+gZY/Z5CuVgIw
+            IBgwiVypHqwDf8loCVIdlo1/h5gctj/t11cxb2hKNRGQ0kFNLdpu5Mx+RbJZ/az/
+            1GgBCQIQB/gKeYbAqSxrJMKl/Q+6PfAXTAjH33K8IlDQKbF8q3QvoQDJJU3i0XwQ
+            ljhWRC/RZzO7hHXJqkR9z5sVIysHoEo+O9DZ0OzefjKb+GscdgSwJwGgsZzrVRXP
+            kSLdNO0eE5ubMQ==
+            =O/Lu
+            -----END PGP MESSAGE-----
+          fp: 4A8AADB4EBAB9AF88EF7062373CECE06CC80D40C
+    encrypted_regex: ^(data|stringData)$
+    version: 3.10.2
--- a/manifests/infrastructure/longhorn/kustomization.yaml
+++ b/manifests/infrastructure/longhorn/kustomization.yaml
@@ -0,0 +1,11 @@
+---
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+resources:
+- namespace.yaml
+- longhorn.yaml
+- storageclass.yaml
+- backblaze-secret.yaml
+- config-map.yaml
+- recurring-job-s3-backup.yaml
+- network-policy-s3-block.yaml
--- a/manifests/infrastructure/longhorn/longhorn.yaml
+++ b/manifests/infrastructure/longhorn/longhorn.yaml
@@ -0,0 +1,64 @@
+---
+apiVersion: source.toolkit.fluxcd.io/v1
+kind: HelmRepository
+metadata:
+  name: longhorn-repo
+  namespace: longhorn-system
+spec:
+  interval: 5m0s
+  url: https://charts.longhorn.io
+---
+apiVersion: helm.toolkit.fluxcd.io/v2
+kind: HelmRelease
+metadata:
+  name: longhorn-release
+  namespace: longhorn-system
+spec:
+  interval: 5m
+  chart:
+    spec:
+      chart: longhorn
+      version: v1.10.0
+      sourceRef:
+        kind: HelmRepository
+        name: longhorn-repo
+        namespace: longhorn-system
+      interval: 1m
+  values:
+    # Use hotfixed longhorn-manager image
+    image:
+      longhorn:
+        manager:
+          tag: v1.10.0-hotfix-1
+    defaultSettings:
+      defaultDataPath: /var/mnt/longhorn-storage
+      defaultReplicaCount: "2"
+      replicaNodeLevelSoftAntiAffinity: true
+      allowVolumeCreationWithDegradedAvailability: false
+      guaranteedInstanceManagerCpu: 5
+      createDefaultDiskLabeledNodes: true
+      # Multi-node optimized settings
+      storageMinimalAvailablePercentage: "20"
+      storageReservedPercentageForDefaultDisk: "15"
+      storageOverProvisioningPercentage: "200"
+    # Single replica for UI
+    service:
+      ui:
+        type: ClusterIP
+    # Longhorn UI replica count
+    longhornUI:
+      replicas: 1
+    # Enable metrics collection
+    metrics:
+      serviceMonitor:
+        enabled: true
+    longhornManager:
+      tolerations:
+      - effect: NoSchedule
+        key: node-role.kubernetes.io/control-plane
+        operator: Exists
+    longhornDriver:
+      tolerations:
+      - effect: NoSchedule
+        key: node-role.kubernetes.io/control-plane
+        operator: Exists
--- a/manifests/infrastructure/longhorn/namespace.yaml
+++ b/manifests/infrastructure/longhorn/namespace.yaml
@@ -0,0 +1,8 @@
+---
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: longhorn-system
+  labels:
+    pod-security.kubernetes.io/enforce: privileged
+    pod-security.kubernetes.io/enforce-version: latest 
--- a/manifests/infrastructure/longhorn/network-policy-s3-block.yaml
+++ b/manifests/infrastructure/longhorn/network-policy-s3-block.yaml
@@ -0,0 +1,211 @@
+---
+# Longhorn S3 Access Control via NetworkPolicy
+# 
+# NetworkPolicy that blocks external S3 access by default, with CronJobs to
+# automatically remove it during backup windows (12:55 AM - 4:00 AM).
+#
+# Network Details:
+# - Pod CIDR: 10.244.0.0/16 (within 10.0.0.0/8)
+# - Service CIDR: 10.96.0.0/12 (within 10.0.0.0/8)
+# - VLAN Network: 10.132.0.0/24 (within 10.0.0.0/8)
+#
+# How It Works:
+# - NetworkPolicy is applied by default, blocking external S3 (Backblaze B2)
+# - CronJob removes NetworkPolicy at 12:55 AM (5 min before earliest backup at 1 AM)
+# - CronJob reapplies NetworkPolicy at 4:00 AM (after backup window closes)
+# - Allows all internal cluster traffic (10.0.0.0/8) while blocking external S3
+#
+# Backup Schedule:
+# - Daily backups: 2:00 AM
+# - Weekly backups: 1:00 AM Sundays
+# - Backup window: 12:55 AM - 4:00 AM (3 hours 5 minutes)
+#
+# See: BACKUP-GUIDE.md and S3-API-SOLUTION-FINAL.md for full documentation
+---
+# NetworkPolicy: Blocks S3 access by default
+# This is applied initially, then managed by CronJobs below
+# Using CiliumNetworkPolicy for better API server support via toEntities
+apiVersion: cilium.io/v2
+kind: CiliumNetworkPolicy
+metadata:
+  name: longhorn-block-s3-access
+  namespace: longhorn-system
+  labels:
+    app: longhorn
+    purpose: s3-access-control
+spec:
+  description: "Block external S3 access while allowing internal cluster communication"
+  endpointSelector:
+    matchLabels:
+      app: longhorn-manager
+  egress:
+    # Allow DNS to kube-system namespace
+    - toEndpoints:
+      - matchLabels:
+          k8s-app: kube-dns
+      toPorts:
+      - ports:
+        - port: "53"
+          protocol: UDP
+        - port: "53"
+          protocol: TCP
+    # Explicitly allow Kubernetes API server (critical for Longhorn)
+    # Cilium handles this specially - kube-apiserver entity is required
+    - toEntities:
+      - kube-apiserver
+    # Allow all internal cluster traffic (10.0.0.0/8)
+    # This includes:
+    # - Pod CIDR: 10.244.0.0/16
+    # - Service CIDR: 10.96.0.0/12 (API server already covered above)
+    # - VLAN Network: 10.132.0.0/24
+    # - All other internal 10.x.x.x addresses
+    - toCIDR:
+      - 10.0.0.0/8
+    # Allow pod-to-pod communication within cluster
+    # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
+    # This explicit rule ensures instance-manager pods are reachable
+    - toEntities:
+      - cluster
+    # Block all other egress (including external S3 like Backblaze B2)
+---
+# RBAC for CronJobs that manage the NetworkPolicy
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: longhorn-netpol-manager
+  namespace: longhorn-system
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: Role
+metadata:
+  name: longhorn-netpol-manager
+  namespace: longhorn-system
+rules:
+- apiGroups: ["cilium.io"]
+  resources: ["ciliumnetworkpolicies"]
+  verbs: ["get", "create", "delete"]
+---
+apiVersion: rbac.authorization.k8s.io/v1
+kind: RoleBinding
+metadata:
+  name: longhorn-netpol-manager
+  namespace: longhorn-system
+roleRef:
+  apiGroup: rbac.authorization.k8s.io
+  kind: Role
+  name: longhorn-netpol-manager
+subjects:
+- kind: ServiceAccount
+  name: longhorn-netpol-manager
+  namespace: longhorn-system
+---
+# CronJob: Remove NetworkPolicy before backups (12:55 AM daily)
+# This allows S3 access during the backup window
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: longhorn-enable-s3-access
+  namespace: longhorn-system
+  labels:
+    app: longhorn
+    purpose: s3-access-control
+spec:
+  # Run at 12:55 AM daily (5 minutes before earliest backup at 1:00 AM Sunday weekly)
+  schedule: "55 0 * * *"
+  successfulJobsHistoryLimit: 2
+  failedJobsHistoryLimit: 2
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        metadata:
+          labels:
+            app: longhorn-netpol-manager
+        spec:
+          serviceAccountName: longhorn-netpol-manager
+          restartPolicy: OnFailure
+          containers:
+          - name: delete-netpol
+            image: bitnami/kubectl:latest
+            imagePullPolicy: IfNotPresent
+            command:
+            - /bin/sh
+            - -c
+            - |
+              echo "Removing CiliumNetworkPolicy to allow S3 access for backups..."
+              kubectl delete ciliumnetworkpolicy longhorn-block-s3-access -n longhorn-system --ignore-not-found=true
+              echo "S3 access enabled. Backups can proceed."
+---
+# CronJob: Re-apply NetworkPolicy after backups (4:00 AM daily)
+# This blocks S3 access after the backup window closes
+apiVersion: batch/v1
+kind: CronJob
+metadata:
+  name: longhorn-disable-s3-access
+  namespace: longhorn-system
+  labels:
+    app: longhorn
+    purpose: s3-access-control
+spec:
+  # Run at 4:00 AM daily (gives 3 hours 5 minutes for backups to complete)
+  schedule: "0 4 * * *"
+  successfulJobsHistoryLimit: 2
+  failedJobsHistoryLimit: 2
+  concurrencyPolicy: Forbid
+  jobTemplate:
+    spec:
+      template:
+        metadata:
+          labels:
+            app: longhorn-netpol-manager
+        spec:
+          serviceAccountName: longhorn-netpol-manager
+          restartPolicy: OnFailure
+          containers:
+          - name: create-netpol
+            image: bitnami/kubectl:latest
+            imagePullPolicy: IfNotPresent
+            command:
+            - /bin/sh
+            - -c
+            - |
+              echo "Re-applying CiliumNetworkPolicy to block S3 access..."
+              kubectl apply -f - <<EOF
+              apiVersion: cilium.io/v2
+              kind: CiliumNetworkPolicy
+              metadata:
+                name: longhorn-block-s3-access
+                namespace: longhorn-system
+                labels:
+                  app: longhorn
+                  purpose: s3-access-control
+              spec:
+                description: "Block external S3 access while allowing internal cluster communication"
+                endpointSelector:
+                  matchLabels:
+                    app: longhorn-manager
+                egress:
+                # Allow DNS to kube-system namespace
+                - toEndpoints:
+                  - matchLabels:
+                      k8s-app: kube-dns
+                  toPorts:
+                  - ports:
+                    - port: "53"
+                      protocol: UDP
+                    - port: "53"
+                      protocol: TCP
+                # Explicitly allow Kubernetes API server (critical for Longhorn)
+                - toEntities:
+                  - kube-apiserver
+                # Allow all internal cluster traffic (10.0.0.0/8)
+                - toCIDR:
+                  - 10.0.0.0/8
+                # Allow pod-to-pod communication within cluster
+                # The 10.0.0.0/8 CIDR block above covers all pod-to-pod communication
+                - toEntities:
+                  - cluster
+                # Block all other egress (including external S3)
+              EOF
+              echo "S3 access blocked. Polling stopped until next backup window."
+
--- a/manifests/infrastructure/longhorn/recurring-job-s3-backup.yaml
+++ b/manifests/infrastructure/longhorn/recurring-job-s3-backup.yaml
@@ -0,0 +1,34 @@
+---
+apiVersion: longhorn.io/v1beta2
+kind: RecurringJob
+metadata:
+  name: s3-backup-daily
+  namespace: longhorn-system
+spec:
+  cron: "0 2 * * *"  # Daily at 2 AM
+  task: "backup"
+  groups:
+  - longhorn-s3-backup
+  retain: 7  # Keep 7 daily backups
+  concurrency: 2  # Max 2 concurrent backup jobs
+  labels:
+    recurring-job: "s3-backup-daily"
+    backup-type: "daily"
+---
+apiVersion: longhorn.io/v1beta2
+kind: RecurringJob
+metadata:
+  name: s3-backup-weekly
+  namespace: longhorn-system
+spec:
+  cron: "0 1 * * 0"  # Weekly on Sunday at 1 AM
+  task: "backup"
+  groups:
+  - longhorn-s3-backup-weekly
+  retain: 4  # Keep 4 weekly backups
+  concurrency: 1  # Only 1 concurrent weekly backup
+  labels:
+    recurring-job: "s3-backup-weekly"
+    backup-type: "weekly"
+  parameters:
+    full-backup-interval: "1"  # Full backup every other week (alternating full/incremental) 
--- a/manifests/infrastructure/longhorn/storageclass.yaml
+++ b/manifests/infrastructure/longhorn/storageclass.yaml
@@ -0,0 +1,81 @@
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-retain
+  annotations:
+    storageclass.kubernetes.io/is-default-class: "false"
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+parameters:
+  numberOfReplicas: "2"
+  staleReplicaTimeout: "2880"
+  fromBackup: ""
+  fsType: "xfs"
+  dataLocality: "best-effort"
+reclaimPolicy: Retain
+volumeBindingMode: Immediate 
+---
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-delete
+  annotations:
+    storageclass.kubernetes.io/is-default-class: "false"
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+parameters:
+  numberOfReplicas: "2"
+  staleReplicaTimeout: "2880"
+  fromBackup: ""
+  fsType: "xfs"
+  dataLocality: "best-effort"
+reclaimPolicy: Delete
+volumeBindingMode: Immediate 
+---
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-single-delete
+  annotations:
+    storageclass.kubernetes.io/is-default-class: "false"
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+parameters:
+  numberOfReplicas: "1"
+  staleReplicaTimeout: "2880"
+  fromBackup: ""
+  fsType: "xfs"
+  dataLocality: "best-effort"
+reclaimPolicy: Delete
+volumeBindingMode: Immediate
+---
+# Redis-specific StorageClass
+# Single replica as Redis handles replication at application level
+# Note: volumeBindingMode is immutable after creation
+# If this StorageClass already exists with matching configuration, Flux reconciliation
+# may show an error but it's harmless - the existing StorageClass will continue to work.
+# For new clusters, this will be created correctly.
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: longhorn-redis
+  annotations:
+    storageclass.kubernetes.io/is-default-class: "false"
+provisioner: driver.longhorn.io
+allowVolumeExpansion: true
+parameters:
+  # Single replica as Redis handles replication at application level
+  numberOfReplicas: "1"
+  staleReplicaTimeout: "2880"
+  fsType: "xfs"  # xfs to match existing Longhorn volumes
+  dataLocality: "strict-local"  # Keep Redis data local to node
+  # Integrate with existing S3 backup infrastructure
+  recurringJobSelector: |
+    [
+      {
+        "name":"longhorn-s3-backup",
+        "isGroup":true
+      }
+    ]
+reclaimPolicy: Delete
+volumeBindingMode: Immediate