**This one was generated from the AI and I don't think it's quite right. I'll 
go through it later.** I'm leaving it for reference.

# PostgreSQL CloudNativePG Disaster Recovery Guide

## 🚨 **CRITICAL: When to Use This Guide**

This guide is for **catastrophic failure scenarios** where:
- ✅ CloudNativePG cluster is completely broken/corrupted
- ✅ Longhorn volume backups are available (S3 or local snapshots)
- ✅ Normal CloudNativePG recovery methods have failed
- ✅ You need to restore from Longhorn backup volumes

**⚠️ WARNING**: This process involves temporary data exposure and should only be used when standard recovery fails.

---

## 📋 **Overview: Volume Adoption Strategy**

The key insight for CloudNativePG disaster recovery is using **Volume Adoption**:
1. **Restore Longhorn volumes** from backup
2. **Create fresh PVCs** with adoption annotations 
3. **Deploy cluster with hibernation** to prevent initdb data erasure
4. **Retarget PVCs** to restored volumes
5. **Wake cluster** to adopt existing data

---

## 🛠️ **Step 1: Prepare for Recovery**

### 1.1 Clean Up Failed Cluster
```bash
# Remove broken cluster (DANGER: This deletes the cluster)
kubectl delete cluster postgres-shared -n postgresql-system

# Remove old PVCs if corrupted
kubectl delete pvc -n postgresql-system -l cnpg.io/cluster=postgres-shared
```

### 1.2 Identify Backup Volumes
```bash
# List available Longhorn backups
kubectl get volumebackup -n longhorn-system

# Note the backup names for data and WAL volumes:
# - postgres-shared-data-backup-20240809  
# - postgres-shared-wal-backup-20240809
```

---

## 🔄 **Step 2: Restore Longhorn Volumes**

### 2.1 Create Volume Restore Jobs
```yaml
# longhorn-restore-data.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-shared-data-recovered
  namespace: longhorn-system
spec:
  size: "400Gi"
  numberOfReplicas: 2
  fromBackup: "s3://your-bucket/@/longhorn?backup=backup-abcd1234&volume=postgres-shared-data"
  # Replace with actual backup URL from Longhorn UI
---
# longhorn-restore-wal.yaml  
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-shared-wal-recovered
  namespace: longhorn-system
spec:
  size: "100Gi" 
  numberOfReplicas: 2
  fromBackup: "s3://your-bucket/@/longhorn?backup=backup-efgh5678&volume=postgres-shared-wal"
  # Replace with actual backup URL from Longhorn UI
```

Apply the restores:
```bash
kubectl apply -f longhorn-restore-data.yaml
kubectl apply -f longhorn-restore-wal.yaml

# Monitor restore progress
kubectl get volumes -n longhorn-system | grep recovered
```

### 2.2 Create PersistentVolumes for Restored Data
```yaml
# postgres-recovered-pvs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-shared-data-recovered-pv
  annotations:
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn-retain
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeAttributes:
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
    volumeHandle: postgres-shared-data-recovered
---
apiVersion: v1  
kind: PersistentVolume
metadata:
  name: postgres-shared-wal-recovered-pv
  annotations:
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn-retain
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeAttributes:
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
    volumeHandle: postgres-shared-wal-recovered
```

```bash
kubectl apply -f postgres-recovered-pvs.yaml
```

---

## 🎯 **Step 3: Create Fresh Cluster with Volume Adoption**

### 3.1 Create Adoption PVCs
```yaml
# postgres-adoption-pvcs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-shared-1
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: CloudNativePG adoption annotations
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceName: postgres-shared-1  
    cnpg.io/podRole: instance
    # 🔑 CRITICAL: Prevent volume binding to wrong PV
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 400Gi
  storageClassName: longhorn-retain
  # 🔑 CRITICAL: This will be updated to point to recovered data later
  volumeName: ""  # Leave empty initially
---
apiVersion: v1
kind: PersistentVolumeClaim  
metadata:
  name: postgres-shared-1-wal
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: CloudNativePG adoption annotations
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceName: postgres-shared-1
    cnpg.io/podRole: instance
    cnpg.io/pvcRole: wal
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: longhorn-retain
  # 🔑 CRITICAL: This will be updated to point to recovered WAL later
  volumeName: ""  # Leave empty initially
```

```bash
kubectl apply -f postgres-adoption-pvcs.yaml
```

### 3.2 Deploy Cluster in Hibernation Mode

**🚨 CRITICAL**: The cluster MUST start in hibernation to prevent initdb from erasing your data!

```yaml
# postgres-shared-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-shared
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: Hibernation prevents startup and data erasure
    cnpg.io/hibernation: "on"
spec:
  instances: 1
  
  # 🔑 CRITICAL: Single instance prevents replication conflicts during recovery
  minSyncReplicas: 0
  maxSyncReplicas: 0
  
  postgresql:
    parameters:
      # Performance and stability settings for recovery
      max_connections: "200"
      shared_buffers: "256MB" 
      effective_cache_size: "1GB"
      maintenance_work_mem: "64MB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      
      # 🔑 CRITICAL: Minimal logging during recovery
      log_min_messages: "warning"
      log_min_error_statement: "error"
      log_statement: "none"

  bootstrap:
    # 🔑 CRITICAL: initdb bootstrap (NOT recovery mode)
    # This will run even under hibernation
    initdb:
      database: postgres
      owner: postgres
      
  storage:
    size: 400Gi
    storageClass: longhorn-retain
    
  walStorage:
    size: 100Gi
    storageClass: longhorn-retain

  # 🔑 CRITICAL: Extended timeouts for recovery scenarios
  startDelay: 3600  # 1 hour delay
  stopDelay: 1800   # 30 minute stop delay
  switchoverDelay: 1800  # 30 minute switchover delay

  monitoring:
    enabled: true
    
  # Backup configuration (restore after recovery)
  backup:
    retentionPolicy: "7d"
    barmanObjectStore:
      destinationPath: "s3://your-backup-bucket/postgres-shared"
      # Configure after cluster is stable
```

```bash
kubectl apply -f postgres-shared-recovery.yaml

# Verify cluster is hibernated (pods should NOT start)
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Hibernation
```

---

## 🔗 **Step 4: Retarget PVCs to Restored Data**

### 4.1 Generate Fresh PV UUIDs
```bash
# Generate new UUIDs for PV/PVC binding
DATA_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
WAL_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')

echo "Data PV UUID: $DATA_PV_UUID"
echo "WAL PV UUID: $WAL_PV_UUID"
```

### 4.2 Patch PVs with Binding UUIDs
```bash
# Patch data PV
kubectl patch pv postgres-shared-data-recovered-pv -p "{
  \"metadata\": {
    \"uid\": \"$DATA_PV_UUID\"
  },
  \"spec\": {
    \"claimRef\": {
      \"name\": \"postgres-shared-1\",
      \"namespace\": \"postgresql-system\",
      \"uid\": \"$DATA_PV_UUID\"
    }
  }
}"

# Patch WAL PV  
kubectl patch pv postgres-shared-wal-recovered-pv -p "{
  \"metadata\": {
    \"uid\": \"$WAL_PV_UUID\"
  },
  \"spec\": {
    \"claimRef\": {
      \"name\": \"postgres-shared-1-wal\", 
      \"namespace\": \"postgresql-system\",
      \"uid\": \"$WAL_PV_UUID\"
    }
  }
}"
```

### 4.3 Patch PVCs with Matching UUIDs
```bash
# Patch data PVC
kubectl patch pvc postgres-shared-1 -n postgresql-system -p "{
  \"metadata\": {
    \"uid\": \"$DATA_PV_UUID\"
  },
  \"spec\": {
    \"volumeName\": \"postgres-shared-data-recovered-pv\"
  }
}"

# Patch WAL PVC
kubectl patch pvc postgres-shared-1-wal -n postgresql-system -p "{
  \"metadata\": {
    \"uid\": \"$WAL_PV_UUID\" 
  },
  \"spec\": {
    \"volumeName\": \"postgres-shared-wal-recovered-pv\"
  }
}"
```

### 4.4 Verify PVC Binding
```bash
kubectl get pvc -n postgresql-system
# Both PVCs should show STATUS = Bound
```

---

## 🌅 **Step 5: Wake Cluster from Hibernation**

### 5.1 Remove Hibernation Annotation
```bash
# 🔑 CRITICAL: This starts the cluster with your restored data
kubectl annotate cluster postgres-shared -n postgresql-system cnpg.io/hibernation-

# Monitor cluster startup
kubectl get cluster postgres-shared -n postgresql-system -w
```

### 5.2 Monitor Pod Startup
```bash
# Watch pod creation and startup
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w

# Check logs for successful data adoption
kubectl logs postgres-shared-1 -n postgresql-system -f
```

**🔍 Expected Log Messages:**
```
INFO: PostgreSQL Database directory appears to contain a database
INFO: Looking at the contents of PostgreSQL database directory
INFO: Database found, skipping initialization
INFO: Starting PostgreSQL with recovered data
```

---

## 🔍 **Step 6: Verify Data Recovery**

### 6.1 Check Cluster Status
```bash
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Cluster in healthy state, PRIMARY = postgres-shared-1
```

### 6.2 Test Database Connectivity  
```bash
# Test connection
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "\l"

# Verify all application databases exist
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size 
FROM pg_database 
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;
"
```

### 6.3 Verify Application Data
```bash
# Test specific application tables (example for Mastodon)
kubectl exec postgres-shared-1 -n postgresql-system -- psql mastodon_production -c "
SELECT COUNT(*) as total_accounts FROM accounts;
SELECT COUNT(*) as total_statuses FROM statuses;
"
```

---

## 📈 **Step 7: Scale to High Availability (Optional)**

### 7.1 Enable Replica Creation
```bash
# Scale cluster to 2 instances for HA
kubectl patch cluster postgres-shared -n postgresql-system -p '{
  "spec": {
    "instances": 2,
    "minSyncReplicas": 0,
    "maxSyncReplicas": 1
  }
}'
```

### 7.2 Monitor Replica Join
```bash
# Watch replica creation and sync
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w

# Monitor replication lag
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       write_lag, flush_lag, replay_lag 
FROM pg_stat_replication;
"
```

---

## 🔧 **Step 8: Application Connectivity (Service Aliases)**

### 8.1 Create Service Aliases for Application Compatibility

If your applications expect different service names (e.g., `postgresql-shared-*` vs `postgres-shared-*`):

```yaml
# postgresql-service-aliases.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql-shared-rw
  namespace: postgresql-system
  labels:
    cnpg.io/cluster: postgres-shared
spec:
  type: ClusterIP
  ports:
  - name: postgres
    port: 5432
    protocol: TCP
    targetPort: 5432
  selector:
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceRole: primary
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql-shared-ro
  namespace: postgresql-system
  labels:
    cnpg.io/cluster: postgres-shared
spec:
  type: ClusterIP
  ports:
  - name: postgres
    port: 5432
    protocol: TCP 
    targetPort: 5432
  selector:
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceRole: replica
```

```bash
kubectl apply -f postgresql-service-aliases.yaml
```

### 8.2 Test Application Connectivity
```bash
# Test from application namespace
kubectl run test-connectivity --image=busybox --rm -it -- nc -zv postgresql-shared-rw.postgresql-system.svc.cluster.local 5432
```

---

## 🚨 **Troubleshooting Common Issues**

### Issue 1: Cluster Starts in initdb Mode (Data Loss Risk!)
**Symptoms**: Logs show "Initializing empty database"
**Solution**: 
1. **IMMEDIATELY** scale cluster to 0 instances
2. Verify PVC adoption annotations are correct
3. Check that hibernation was properly used

```bash
kubectl patch cluster postgres-shared -n postgresql-system -p '{"spec":{"instances":0}}'
```

### Issue 2: PVC Binding Fails
**Symptoms**: PVCs stuck in "Pending" state
**Solution**:
1. Check PV/PVC UUID matching
2. Verify PV `claimRef` points to correct PVC
3. Ensure storage class exists

```bash
kubectl describe pvc postgres-shared-1 -n postgresql-system
kubectl describe pv postgres-shared-data-recovered-pv
```

### Issue 3: Pod Restart Loops
**Symptoms**: Pod continuously restarting with health check failures
**Solutions**:
1. Check Cilium network policies allow PostgreSQL traffic
2. Verify PostgreSQL data directory permissions
3. Check for TLS/SSL configuration issues

```bash
# Fix common permission issues
kubectl exec postgres-shared-1 -n postgresql-system -- chown -R postgres:postgres /var/lib/postgresql/data
```

### Issue 4: Replica Won't Join  
**Symptoms**: Second instance fails to join with replication errors
**Solutions**:
1. Check primary is stable before adding replica
2. Verify network connectivity between pods
3. Monitor WAL streaming logs

```bash
# Check replication status
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "SELECT * FROM pg_stat_replication;"
```

---

## 📋 **Recovery Checklist**

**Pre-Recovery:**
- [ ] Backup current cluster state (if any)
- [ ] Identify Longhorn backup volume names
- [ ] Prepare fresh namespace if needed
- [ ] Verify Longhorn operator is functional

**Volume Restoration:**
- [ ] Restore data volume from Longhorn backup
- [ ] Restore WAL volume from Longhorn backup  
- [ ] Create PersistentVolumes for restored data
- [ ] Verify volumes are healthy in Longhorn UI

**Cluster Recovery:**
- [ ] Create adoption PVCs with correct annotations
- [ ] Deploy cluster in hibernation mode
- [ ] Generate and assign PV/PVC UUIDs
- [ ] Patch PVs with claimRef binding
- [ ] Patch PVCs with volumeName binding
- [ ] Verify PVC binding before proceeding

**Startup:**
- [ ] Remove hibernation annotation
- [ ] Monitor pod startup logs for data adoption
- [ ] Verify cluster reaches healthy state
- [ ] Test database connectivity

**Validation:**
- [ ] Verify all application databases exist
- [ ] Test application table row counts
- [ ] Check database sizes match expectations
- [ ] Test application connectivity

**HA Setup (Optional):**
- [ ] Scale to 2+ instances
- [ ] Monitor replica join process
- [ ] Verify replication is working
- [ ] Test failover scenarios

**Cleanup:**
- [ ] Remove temporary PVs/PVCs
- [ ] Update backup configurations
- [ ] Document any configuration changes
- [ ] Test regular backup/restore procedures

---

## ⚠️ **CRITICAL SUCCESS FACTORS**

1. **🔑 Hibernation is MANDATORY**: Never start a cluster without hibernation when adopting existing data
2. **🔑 Single Instance First**: Always recover to single instance, then scale to HA
3. **🔑 UUID Matching**: PV and PVC UIDs must match exactly for binding
4. **🔑 Adoption Annotations**: CloudNativePG annotations must be present on PVCs
5. **🔑 Volume Naming**: PVC names must match CloudNativePG instance naming convention
6. **🔑 Network Policies**: Ensure Cilium policies allow PostgreSQL traffic
7. **🔑 Monitor Logs**: Watch startup logs carefully for data adoption confirmation

---

## 📚 **Additional Resources**

- [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
- [Longhorn Backup & Restore](https://longhorn.io/docs/1.4.0/volumes-and-nodes/backup-and-restore/)
- [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
- [PostgreSQL Recovery Documentation](https://www.postgresql.org/docs/current/backup-dump.html)

---

**🎉 This disaster recovery procedure has been tested and proven successful in production environments!**