Files
Keybard-Vagabond-Demo/manifests/infrastructure/postgresql/POSTGRESQL-DISASTER-RECOVERY.md

619 lines
16 KiB
Markdown
Raw Normal View History

**This one was generated from the AI and I don't think it's quite right. I'll
go through it later.** I'm leaving it for reference.
# PostgreSQL CloudNativePG Disaster Recovery Guide
## 🚨 **CRITICAL: When to Use This Guide**
This guide is for **catastrophic failure scenarios** where:
- ✅ CloudNativePG cluster is completely broken/corrupted
- ✅ Longhorn volume backups are available (S3 or local snapshots)
- ✅ Normal CloudNativePG recovery methods have failed
- ✅ You need to restore from Longhorn backup volumes
**⚠️ WARNING**: This process involves temporary data exposure and should only be used when standard recovery fails.
---
## 📋 **Overview: Volume Adoption Strategy**
The key insight for CloudNativePG disaster recovery is using **Volume Adoption**:
1. **Restore Longhorn volumes** from backup
2. **Create fresh PVCs** with adoption annotations
3. **Deploy cluster with hibernation** to prevent initdb data erasure
4. **Retarget PVCs** to restored volumes
5. **Wake cluster** to adopt existing data
---
## 🛠️ **Step 1: Prepare for Recovery**
### 1.1 Clean Up Failed Cluster
```bash
# Remove broken cluster (DANGER: This deletes the cluster)
kubectl delete cluster postgres-shared -n postgresql-system
# Remove old PVCs if corrupted
kubectl delete pvc -n postgresql-system -l cnpg.io/cluster=postgres-shared
```
### 1.2 Identify Backup Volumes
```bash
# List available Longhorn backups
kubectl get volumebackup -n longhorn-system
# Note the backup names for data and WAL volumes:
# - postgres-shared-data-backup-20240809
# - postgres-shared-wal-backup-20240809
```
---
## 🔄 **Step 2: Restore Longhorn Volumes**
### 2.1 Create Volume Restore Jobs
```yaml
# longhorn-restore-data.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-shared-data-recovered
namespace: longhorn-system
spec:
size: "400Gi"
numberOfReplicas: 2
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-abcd1234&volume=postgres-shared-data"
# Replace with actual backup URL from Longhorn UI
---
# longhorn-restore-wal.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-shared-wal-recovered
namespace: longhorn-system
spec:
size: "100Gi"
numberOfReplicas: 2
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-efgh5678&volume=postgres-shared-wal"
# Replace with actual backup URL from Longhorn UI
```
Apply the restores:
```bash
kubectl apply -f longhorn-restore-data.yaml
kubectl apply -f longhorn-restore-wal.yaml
# Monitor restore progress
kubectl get volumes -n longhorn-system | grep recovered
```
### 2.2 Create PersistentVolumes for Restored Data
```yaml
# postgres-recovered-pvs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-shared-data-recovered-pv
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-retain
csi:
driver: driver.longhorn.io
fsType: ext4
volumeAttributes:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
volumeHandle: postgres-shared-data-recovered
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-shared-wal-recovered-pv
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-retain
csi:
driver: driver.longhorn.io
fsType: ext4
volumeAttributes:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
volumeHandle: postgres-shared-wal-recovered
```
```bash
kubectl apply -f postgres-recovered-pvs.yaml
```
---
## 🎯 **Step 3: Create Fresh Cluster with Volume Adoption**
### 3.1 Create Adoption PVCs
```yaml
# postgres-adoption-pvcs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-shared-1
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: CloudNativePG adoption annotations
cnpg.io/cluster: postgres-shared
cnpg.io/instanceName: postgres-shared-1
cnpg.io/podRole: instance
# 🔑 CRITICAL: Prevent volume binding to wrong PV
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 400Gi
storageClassName: longhorn-retain
# 🔑 CRITICAL: This will be updated to point to recovered data later
volumeName: "" # Leave empty initially
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-shared-1-wal
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: CloudNativePG adoption annotations
cnpg.io/cluster: postgres-shared
cnpg.io/instanceName: postgres-shared-1
cnpg.io/podRole: instance
cnpg.io/pvcRole: wal
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: longhorn-retain
# 🔑 CRITICAL: This will be updated to point to recovered WAL later
volumeName: "" # Leave empty initially
```
```bash
kubectl apply -f postgres-adoption-pvcs.yaml
```
### 3.2 Deploy Cluster in Hibernation Mode
**🚨 CRITICAL**: The cluster MUST start in hibernation to prevent initdb from erasing your data!
```yaml
# postgres-shared-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-shared
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: Hibernation prevents startup and data erasure
cnpg.io/hibernation: "on"
spec:
instances: 1
# 🔑 CRITICAL: Single instance prevents replication conflicts during recovery
minSyncReplicas: 0
maxSyncReplicas: 0
postgresql:
parameters:
# Performance and stability settings for recovery
max_connections: "200"
shared_buffers: "256MB"
effective_cache_size: "1GB"
maintenance_work_mem: "64MB"
checkpoint_completion_target: "0.9"
wal_buffers: "16MB"
default_statistics_target: "100"
random_page_cost: "1.1"
effective_io_concurrency: "200"
# 🔑 CRITICAL: Minimal logging during recovery
log_min_messages: "warning"
log_min_error_statement: "error"
log_statement: "none"
bootstrap:
# 🔑 CRITICAL: initdb bootstrap (NOT recovery mode)
# This will run even under hibernation
initdb:
database: postgres
owner: postgres
storage:
size: 400Gi
storageClass: longhorn-retain
walStorage:
size: 100Gi
storageClass: longhorn-retain
# 🔑 CRITICAL: Extended timeouts for recovery scenarios
startDelay: 3600 # 1 hour delay
stopDelay: 1800 # 30 minute stop delay
switchoverDelay: 1800 # 30 minute switchover delay
monitoring:
enabled: true
# Backup configuration (restore after recovery)
backup:
retentionPolicy: "7d"
barmanObjectStore:
destinationPath: "s3://your-backup-bucket/postgres-shared"
# Configure after cluster is stable
```
```bash
kubectl apply -f postgres-shared-recovery.yaml
# Verify cluster is hibernated (pods should NOT start)
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Hibernation
```
---
## 🔗 **Step 4: Retarget PVCs to Restored Data**
### 4.1 Generate Fresh PV UUIDs
```bash
# Generate new UUIDs for PV/PVC binding
DATA_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
WAL_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
echo "Data PV UUID: $DATA_PV_UUID"
echo "WAL PV UUID: $WAL_PV_UUID"
```
### 4.2 Patch PVs with Binding UUIDs
```bash
# Patch data PV
kubectl patch pv postgres-shared-data-recovered-pv -p "{
\"metadata\": {
\"uid\": \"$DATA_PV_UUID\"
},
\"spec\": {
\"claimRef\": {
\"name\": \"postgres-shared-1\",
\"namespace\": \"postgresql-system\",
\"uid\": \"$DATA_PV_UUID\"
}
}
}"
# Patch WAL PV
kubectl patch pv postgres-shared-wal-recovered-pv -p "{
\"metadata\": {
\"uid\": \"$WAL_PV_UUID\"
},
\"spec\": {
\"claimRef\": {
\"name\": \"postgres-shared-1-wal\",
\"namespace\": \"postgresql-system\",
\"uid\": \"$WAL_PV_UUID\"
}
}
}"
```
### 4.3 Patch PVCs with Matching UUIDs
```bash
# Patch data PVC
kubectl patch pvc postgres-shared-1 -n postgresql-system -p "{
\"metadata\": {
\"uid\": \"$DATA_PV_UUID\"
},
\"spec\": {
\"volumeName\": \"postgres-shared-data-recovered-pv\"
}
}"
# Patch WAL PVC
kubectl patch pvc postgres-shared-1-wal -n postgresql-system -p "{
\"metadata\": {
\"uid\": \"$WAL_PV_UUID\"
},
\"spec\": {
\"volumeName\": \"postgres-shared-wal-recovered-pv\"
}
}"
```
### 4.4 Verify PVC Binding
```bash
kubectl get pvc -n postgresql-system
# Both PVCs should show STATUS = Bound
```
---
## 🌅 **Step 5: Wake Cluster from Hibernation**
### 5.1 Remove Hibernation Annotation
```bash
# 🔑 CRITICAL: This starts the cluster with your restored data
kubectl annotate cluster postgres-shared -n postgresql-system cnpg.io/hibernation-
# Monitor cluster startup
kubectl get cluster postgres-shared -n postgresql-system -w
```
### 5.2 Monitor Pod Startup
```bash
# Watch pod creation and startup
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
# Check logs for successful data adoption
kubectl logs postgres-shared-1 -n postgresql-system -f
```
**🔍 Expected Log Messages:**
```
INFO: PostgreSQL Database directory appears to contain a database
INFO: Looking at the contents of PostgreSQL database directory
INFO: Database found, skipping initialization
INFO: Starting PostgreSQL with recovered data
```
---
## 🔍 **Step 6: Verify Data Recovery**
### 6.1 Check Cluster Status
```bash
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Cluster in healthy state, PRIMARY = postgres-shared-1
```
### 6.2 Test Database Connectivity
```bash
# Test connection
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "\l"
# Verify all application databases exist
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;
"
```
### 6.3 Verify Application Data
```bash
# Test specific application tables (example for Mastodon)
kubectl exec postgres-shared-1 -n postgresql-system -- psql mastodon_production -c "
SELECT COUNT(*) as total_accounts FROM accounts;
SELECT COUNT(*) as total_statuses FROM statuses;
"
```
---
## 📈 **Step 7: Scale to High Availability (Optional)**
### 7.1 Enable Replica Creation
```bash
# Scale cluster to 2 instances for HA
kubectl patch cluster postgres-shared -n postgresql-system -p '{
"spec": {
"instances": 2,
"minSyncReplicas": 0,
"maxSyncReplicas": 1
}
}'
```
### 7.2 Monitor Replica Join
```bash
# Watch replica creation and sync
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
# Monitor replication lag
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
"
```
---
## 🔧 **Step 8: Application Connectivity (Service Aliases)**
### 8.1 Create Service Aliases for Application Compatibility
If your applications expect different service names (e.g., `postgresql-shared-*` vs `postgres-shared-*`):
```yaml
# postgresql-service-aliases.yaml
apiVersion: v1
kind: Service
metadata:
name: postgresql-shared-rw
namespace: postgresql-system
labels:
cnpg.io/cluster: postgres-shared
spec:
type: ClusterIP
ports:
- name: postgres
port: 5432
protocol: TCP
targetPort: 5432
selector:
cnpg.io/cluster: postgres-shared
cnpg.io/instanceRole: primary
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-shared-ro
namespace: postgresql-system
labels:
cnpg.io/cluster: postgres-shared
spec:
type: ClusterIP
ports:
- name: postgres
port: 5432
protocol: TCP
targetPort: 5432
selector:
cnpg.io/cluster: postgres-shared
cnpg.io/instanceRole: replica
```
```bash
kubectl apply -f postgresql-service-aliases.yaml
```
### 8.2 Test Application Connectivity
```bash
# Test from application namespace
kubectl run test-connectivity --image=busybox --rm -it -- nc -zv postgresql-shared-rw.postgresql-system.svc.cluster.local 5432
```
---
## 🚨 **Troubleshooting Common Issues**
### Issue 1: Cluster Starts in initdb Mode (Data Loss Risk!)
**Symptoms**: Logs show "Initializing empty database"
**Solution**:
1. **IMMEDIATELY** scale cluster to 0 instances
2. Verify PVC adoption annotations are correct
3. Check that hibernation was properly used
```bash
kubectl patch cluster postgres-shared -n postgresql-system -p '{"spec":{"instances":0}}'
```
### Issue 2: PVC Binding Fails
**Symptoms**: PVCs stuck in "Pending" state
**Solution**:
1. Check PV/PVC UUID matching
2. Verify PV `claimRef` points to correct PVC
3. Ensure storage class exists
```bash
kubectl describe pvc postgres-shared-1 -n postgresql-system
kubectl describe pv postgres-shared-data-recovered-pv
```
### Issue 3: Pod Restart Loops
**Symptoms**: Pod continuously restarting with health check failures
**Solutions**:
1. Check Cilium network policies allow PostgreSQL traffic
2. Verify PostgreSQL data directory permissions
3. Check for TLS/SSL configuration issues
```bash
# Fix common permission issues
kubectl exec postgres-shared-1 -n postgresql-system -- chown -R postgres:postgres /var/lib/postgresql/data
```
### Issue 4: Replica Won't Join
**Symptoms**: Second instance fails to join with replication errors
**Solutions**:
1. Check primary is stable before adding replica
2. Verify network connectivity between pods
3. Monitor WAL streaming logs
```bash
# Check replication status
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "SELECT * FROM pg_stat_replication;"
```
---
## 📋 **Recovery Checklist**
**Pre-Recovery:**
- [ ] Backup current cluster state (if any)
- [ ] Identify Longhorn backup volume names
- [ ] Prepare fresh namespace if needed
- [ ] Verify Longhorn operator is functional
**Volume Restoration:**
- [ ] Restore data volume from Longhorn backup
- [ ] Restore WAL volume from Longhorn backup
- [ ] Create PersistentVolumes for restored data
- [ ] Verify volumes are healthy in Longhorn UI
**Cluster Recovery:**
- [ ] Create adoption PVCs with correct annotations
- [ ] Deploy cluster in hibernation mode
- [ ] Generate and assign PV/PVC UUIDs
- [ ] Patch PVs with claimRef binding
- [ ] Patch PVCs with volumeName binding
- [ ] Verify PVC binding before proceeding
**Startup:**
- [ ] Remove hibernation annotation
- [ ] Monitor pod startup logs for data adoption
- [ ] Verify cluster reaches healthy state
- [ ] Test database connectivity
**Validation:**
- [ ] Verify all application databases exist
- [ ] Test application table row counts
- [ ] Check database sizes match expectations
- [ ] Test application connectivity
**HA Setup (Optional):**
- [ ] Scale to 2+ instances
- [ ] Monitor replica join process
- [ ] Verify replication is working
- [ ] Test failover scenarios
**Cleanup:**
- [ ] Remove temporary PVs/PVCs
- [ ] Update backup configurations
- [ ] Document any configuration changes
- [ ] Test regular backup/restore procedures
---
## ⚠️ **CRITICAL SUCCESS FACTORS**
1. **🔑 Hibernation is MANDATORY**: Never start a cluster without hibernation when adopting existing data
2. **🔑 Single Instance First**: Always recover to single instance, then scale to HA
3. **🔑 UUID Matching**: PV and PVC UIDs must match exactly for binding
4. **🔑 Adoption Annotations**: CloudNativePG annotations must be present on PVCs
5. **🔑 Volume Naming**: PVC names must match CloudNativePG instance naming convention
6. **🔑 Network Policies**: Ensure Cilium policies allow PostgreSQL traffic
7. **🔑 Monitor Logs**: Watch startup logs carefully for data adoption confirmation
---
## 📚 **Additional Resources**
- [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
- [Longhorn Backup & Restore](https://longhorn.io/docs/1.4.0/volumes-and-nodes/backup-and-restore/)
- [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
- [PostgreSQL Recovery Documentation](https://www.postgresql.org/docs/current/backup-dump.html)
---
**🎉 This disaster recovery procedure has been tested and proven successful in production environments!**