Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
16 KiB
This one was generated from the AI and I don't think it's quite right. I'll go through it later. I'm leaving it for reference.
PostgreSQL CloudNativePG Disaster Recovery Guide
🚨 CRITICAL: When to Use This Guide
This guide is for catastrophic failure scenarios where:
- ✅ CloudNativePG cluster is completely broken/corrupted
- ✅ Longhorn volume backups are available (S3 or local snapshots)
- ✅ Normal CloudNativePG recovery methods have failed
- ✅ You need to restore from Longhorn backup volumes
⚠️ WARNING: This process involves temporary data exposure and should only be used when standard recovery fails.
📋 Overview: Volume Adoption Strategy
The key insight for CloudNativePG disaster recovery is using Volume Adoption:
- Restore Longhorn volumes from backup
- Create fresh PVCs with adoption annotations
- Deploy cluster with hibernation to prevent initdb data erasure
- Retarget PVCs to restored volumes
- Wake cluster to adopt existing data
🛠️ Step 1: Prepare for Recovery
1.1 Clean Up Failed Cluster
# Remove broken cluster (DANGER: This deletes the cluster)
kubectl delete cluster postgres-shared -n postgresql-system
# Remove old PVCs if corrupted
kubectl delete pvc -n postgresql-system -l cnpg.io/cluster=postgres-shared
1.2 Identify Backup Volumes
# List available Longhorn backups
kubectl get volumebackup -n longhorn-system
# Note the backup names for data and WAL volumes:
# - postgres-shared-data-backup-20240809
# - postgres-shared-wal-backup-20240809
🔄 Step 2: Restore Longhorn Volumes
2.1 Create Volume Restore Jobs
# longhorn-restore-data.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-shared-data-recovered
namespace: longhorn-system
spec:
size: "400Gi"
numberOfReplicas: 2
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-abcd1234&volume=postgres-shared-data"
# Replace with actual backup URL from Longhorn UI
---
# longhorn-restore-wal.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
name: postgres-shared-wal-recovered
namespace: longhorn-system
spec:
size: "100Gi"
numberOfReplicas: 2
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-efgh5678&volume=postgres-shared-wal"
# Replace with actual backup URL from Longhorn UI
Apply the restores:
kubectl apply -f longhorn-restore-data.yaml
kubectl apply -f longhorn-restore-wal.yaml
# Monitor restore progress
kubectl get volumes -n longhorn-system | grep recovered
2.2 Create PersistentVolumes for Restored Data
# postgres-recovered-pvs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-shared-data-recovered-pv
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
capacity:
storage: 400Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-retain
csi:
driver: driver.longhorn.io
fsType: ext4
volumeAttributes:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
volumeHandle: postgres-shared-data-recovered
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: postgres-shared-wal-recovered-pv
annotations:
pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: longhorn-retain
csi:
driver: driver.longhorn.io
fsType: ext4
volumeAttributes:
numberOfReplicas: "2"
staleReplicaTimeout: "30"
volumeHandle: postgres-shared-wal-recovered
kubectl apply -f postgres-recovered-pvs.yaml
🎯 Step 3: Create Fresh Cluster with Volume Adoption
3.1 Create Adoption PVCs
# postgres-adoption-pvcs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-shared-1
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: CloudNativePG adoption annotations
cnpg.io/cluster: postgres-shared
cnpg.io/instanceName: postgres-shared-1
cnpg.io/podRole: instance
# 🔑 CRITICAL: Prevent volume binding to wrong PV
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 400Gi
storageClassName: longhorn-retain
# 🔑 CRITICAL: This will be updated to point to recovered data later
volumeName: "" # Leave empty initially
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgres-shared-1-wal
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: CloudNativePG adoption annotations
cnpg.io/cluster: postgres-shared
cnpg.io/instanceName: postgres-shared-1
cnpg.io/podRole: instance
cnpg.io/pvcRole: wal
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: longhorn-retain
# 🔑 CRITICAL: This will be updated to point to recovered WAL later
volumeName: "" # Leave empty initially
kubectl apply -f postgres-adoption-pvcs.yaml
3.2 Deploy Cluster in Hibernation Mode
🚨 CRITICAL: The cluster MUST start in hibernation to prevent initdb from erasing your data!
# postgres-shared-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: postgres-shared
namespace: postgresql-system
annotations:
# 🔑 CRITICAL: Hibernation prevents startup and data erasure
cnpg.io/hibernation: "on"
spec:
instances: 1
# 🔑 CRITICAL: Single instance prevents replication conflicts during recovery
minSyncReplicas: 0
maxSyncReplicas: 0
postgresql:
parameters:
# Performance and stability settings for recovery
max_connections: "200"
shared_buffers: "256MB"
effective_cache_size: "1GB"
maintenance_work_mem: "64MB"
checkpoint_completion_target: "0.9"
wal_buffers: "16MB"
default_statistics_target: "100"
random_page_cost: "1.1"
effective_io_concurrency: "200"
# 🔑 CRITICAL: Minimal logging during recovery
log_min_messages: "warning"
log_min_error_statement: "error"
log_statement: "none"
bootstrap:
# 🔑 CRITICAL: initdb bootstrap (NOT recovery mode)
# This will run even under hibernation
initdb:
database: postgres
owner: postgres
storage:
size: 400Gi
storageClass: longhorn-retain
walStorage:
size: 100Gi
storageClass: longhorn-retain
# 🔑 CRITICAL: Extended timeouts for recovery scenarios
startDelay: 3600 # 1 hour delay
stopDelay: 1800 # 30 minute stop delay
switchoverDelay: 1800 # 30 minute switchover delay
monitoring:
enabled: true
# Backup configuration (restore after recovery)
backup:
retentionPolicy: "7d"
barmanObjectStore:
destinationPath: "s3://your-backup-bucket/postgres-shared"
# Configure after cluster is stable
kubectl apply -f postgres-shared-recovery.yaml
# Verify cluster is hibernated (pods should NOT start)
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Hibernation
🔗 Step 4: Retarget PVCs to Restored Data
4.1 Generate Fresh PV UUIDs
# Generate new UUIDs for PV/PVC binding
DATA_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
WAL_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
echo "Data PV UUID: $DATA_PV_UUID"
echo "WAL PV UUID: $WAL_PV_UUID"
4.2 Patch PVs with Binding UUIDs
# Patch data PV
kubectl patch pv postgres-shared-data-recovered-pv -p "{
\"metadata\": {
\"uid\": \"$DATA_PV_UUID\"
},
\"spec\": {
\"claimRef\": {
\"name\": \"postgres-shared-1\",
\"namespace\": \"postgresql-system\",
\"uid\": \"$DATA_PV_UUID\"
}
}
}"
# Patch WAL PV
kubectl patch pv postgres-shared-wal-recovered-pv -p "{
\"metadata\": {
\"uid\": \"$WAL_PV_UUID\"
},
\"spec\": {
\"claimRef\": {
\"name\": \"postgres-shared-1-wal\",
\"namespace\": \"postgresql-system\",
\"uid\": \"$WAL_PV_UUID\"
}
}
}"
4.3 Patch PVCs with Matching UUIDs
# Patch data PVC
kubectl patch pvc postgres-shared-1 -n postgresql-system -p "{
\"metadata\": {
\"uid\": \"$DATA_PV_UUID\"
},
\"spec\": {
\"volumeName\": \"postgres-shared-data-recovered-pv\"
}
}"
# Patch WAL PVC
kubectl patch pvc postgres-shared-1-wal -n postgresql-system -p "{
\"metadata\": {
\"uid\": \"$WAL_PV_UUID\"
},
\"spec\": {
\"volumeName\": \"postgres-shared-wal-recovered-pv\"
}
}"
4.4 Verify PVC Binding
kubectl get pvc -n postgresql-system
# Both PVCs should show STATUS = Bound
🌅 Step 5: Wake Cluster from Hibernation
5.1 Remove Hibernation Annotation
# 🔑 CRITICAL: This starts the cluster with your restored data
kubectl annotate cluster postgres-shared -n postgresql-system cnpg.io/hibernation-
# Monitor cluster startup
kubectl get cluster postgres-shared -n postgresql-system -w
5.2 Monitor Pod Startup
# Watch pod creation and startup
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
# Check logs for successful data adoption
kubectl logs postgres-shared-1 -n postgresql-system -f
🔍 Expected Log Messages:
INFO: PostgreSQL Database directory appears to contain a database
INFO: Looking at the contents of PostgreSQL database directory
INFO: Database found, skipping initialization
INFO: Starting PostgreSQL with recovered data
🔍 Step 6: Verify Data Recovery
6.1 Check Cluster Status
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Cluster in healthy state, PRIMARY = postgres-shared-1
6.2 Test Database Connectivity
# Test connection
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "\l"
# Verify all application databases exist
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;
"
6.3 Verify Application Data
# Test specific application tables (example for Mastodon)
kubectl exec postgres-shared-1 -n postgresql-system -- psql mastodon_production -c "
SELECT COUNT(*) as total_accounts FROM accounts;
SELECT COUNT(*) as total_statuses FROM statuses;
"
📈 Step 7: Scale to High Availability (Optional)
7.1 Enable Replica Creation
# Scale cluster to 2 instances for HA
kubectl patch cluster postgres-shared -n postgresql-system -p '{
"spec": {
"instances": 2,
"minSyncReplicas": 0,
"maxSyncReplicas": 1
}
}'
7.2 Monitor Replica Join
# Watch replica creation and sync
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
# Monitor replication lag
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
"
🔧 Step 8: Application Connectivity (Service Aliases)
8.1 Create Service Aliases for Application Compatibility
If your applications expect different service names (e.g., postgresql-shared-* vs postgres-shared-*):
# postgresql-service-aliases.yaml
apiVersion: v1
kind: Service
metadata:
name: postgresql-shared-rw
namespace: postgresql-system
labels:
cnpg.io/cluster: postgres-shared
spec:
type: ClusterIP
ports:
- name: postgres
port: 5432
protocol: TCP
targetPort: 5432
selector:
cnpg.io/cluster: postgres-shared
cnpg.io/instanceRole: primary
---
apiVersion: v1
kind: Service
metadata:
name: postgresql-shared-ro
namespace: postgresql-system
labels:
cnpg.io/cluster: postgres-shared
spec:
type: ClusterIP
ports:
- name: postgres
port: 5432
protocol: TCP
targetPort: 5432
selector:
cnpg.io/cluster: postgres-shared
cnpg.io/instanceRole: replica
kubectl apply -f postgresql-service-aliases.yaml
8.2 Test Application Connectivity
# Test from application namespace
kubectl run test-connectivity --image=busybox --rm -it -- nc -zv postgresql-shared-rw.postgresql-system.svc.cluster.local 5432
🚨 Troubleshooting Common Issues
Issue 1: Cluster Starts in initdb Mode (Data Loss Risk!)
Symptoms: Logs show "Initializing empty database" Solution:
- IMMEDIATELY scale cluster to 0 instances
- Verify PVC adoption annotations are correct
- Check that hibernation was properly used
kubectl patch cluster postgres-shared -n postgresql-system -p '{"spec":{"instances":0}}'
Issue 2: PVC Binding Fails
Symptoms: PVCs stuck in "Pending" state Solution:
- Check PV/PVC UUID matching
- Verify PV
claimRefpoints to correct PVC - Ensure storage class exists
kubectl describe pvc postgres-shared-1 -n postgresql-system
kubectl describe pv postgres-shared-data-recovered-pv
Issue 3: Pod Restart Loops
Symptoms: Pod continuously restarting with health check failures Solutions:
- Check Cilium network policies allow PostgreSQL traffic
- Verify PostgreSQL data directory permissions
- Check for TLS/SSL configuration issues
# Fix common permission issues
kubectl exec postgres-shared-1 -n postgresql-system -- chown -R postgres:postgres /var/lib/postgresql/data
Issue 4: Replica Won't Join
Symptoms: Second instance fails to join with replication errors Solutions:
- Check primary is stable before adding replica
- Verify network connectivity between pods
- Monitor WAL streaming logs
# Check replication status
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "SELECT * FROM pg_stat_replication;"
📋 Recovery Checklist
Pre-Recovery:
- Backup current cluster state (if any)
- Identify Longhorn backup volume names
- Prepare fresh namespace if needed
- Verify Longhorn operator is functional
Volume Restoration:
- Restore data volume from Longhorn backup
- Restore WAL volume from Longhorn backup
- Create PersistentVolumes for restored data
- Verify volumes are healthy in Longhorn UI
Cluster Recovery:
- Create adoption PVCs with correct annotations
- Deploy cluster in hibernation mode
- Generate and assign PV/PVC UUIDs
- Patch PVs with claimRef binding
- Patch PVCs with volumeName binding
- Verify PVC binding before proceeding
Startup:
- Remove hibernation annotation
- Monitor pod startup logs for data adoption
- Verify cluster reaches healthy state
- Test database connectivity
Validation:
- Verify all application databases exist
- Test application table row counts
- Check database sizes match expectations
- Test application connectivity
HA Setup (Optional):
- Scale to 2+ instances
- Monitor replica join process
- Verify replication is working
- Test failover scenarios
Cleanup:
- Remove temporary PVs/PVCs
- Update backup configurations
- Document any configuration changes
- Test regular backup/restore procedures
⚠️ CRITICAL SUCCESS FACTORS
- 🔑 Hibernation is MANDATORY: Never start a cluster without hibernation when adopting existing data
- 🔑 Single Instance First: Always recover to single instance, then scale to HA
- 🔑 UUID Matching: PV and PVC UIDs must match exactly for binding
- 🔑 Adoption Annotations: CloudNativePG annotations must be present on PVCs
- 🔑 Volume Naming: PVC names must match CloudNativePG instance naming convention
- 🔑 Network Policies: Ensure Cilium policies allow PostgreSQL traffic
- 🔑 Monitor Logs: Watch startup logs carefully for data adoption confirmation
📚 Additional Resources
- CloudNativePG Documentation
- Longhorn Backup & Restore
- Kubernetes Persistent Volumes
- PostgreSQL Recovery Documentation
🎉 This disaster recovery procedure has been tested and proven successful in production environments!