Files

Michael DiLeo 74324d5a1b add source code and readme

2025-12-24 14:35:17 +01:00

16 KiB

Raw Blame History

This one was generated from the AI and I don't think it's quite right. I'll go through it later. I'm leaving it for reference.

PostgreSQL CloudNativePG Disaster Recovery Guide

🚨 CRITICAL: When to Use This Guide

This guide is for catastrophic failure scenarios where:

✅ CloudNativePG cluster is completely broken/corrupted
✅ Longhorn volume backups are available (S3 or local snapshots)
✅ Normal CloudNativePG recovery methods have failed
✅ You need to restore from Longhorn backup volumes

⚠️ WARNING: This process involves temporary data exposure and should only be used when standard recovery fails.

📋 Overview: Volume Adoption Strategy

The key insight for CloudNativePG disaster recovery is using Volume Adoption:

Restore Longhorn volumes from backup
Create fresh PVCs with adoption annotations
Deploy cluster with hibernation to prevent initdb data erasure
Retarget PVCs to restored volumes
Wake cluster to adopt existing data

🛠️ Step 1: Prepare for Recovery

1.1 Clean Up Failed Cluster

# Remove broken cluster (DANGER: This deletes the cluster)
kubectl delete cluster postgres-shared -n postgresql-system

# Remove old PVCs if corrupted
kubectl delete pvc -n postgresql-system -l cnpg.io/cluster=postgres-shared

1.2 Identify Backup Volumes

# List available Longhorn backups
kubectl get volumebackup -n longhorn-system

# Note the backup names for data and WAL volumes:
# - postgres-shared-data-backup-20240809  
# - postgres-shared-wal-backup-20240809

🔄 Step 2: Restore Longhorn Volumes

2.1 Create Volume Restore Jobs

# longhorn-restore-data.yaml
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-shared-data-recovered
  namespace: longhorn-system
spec:
  size: "400Gi"
  numberOfReplicas: 2
  fromBackup: "s3://your-bucket/@/longhorn?backup=backup-abcd1234&volume=postgres-shared-data"
  # Replace with actual backup URL from Longhorn UI
---
# longhorn-restore-wal.yaml  
apiVersion: longhorn.io/v1beta2
kind: Volume
metadata:
  name: postgres-shared-wal-recovered
  namespace: longhorn-system
spec:
  size: "100Gi" 
  numberOfReplicas: 2
  fromBackup: "s3://your-bucket/@/longhorn?backup=backup-efgh5678&volume=postgres-shared-wal"
  # Replace with actual backup URL from Longhorn UI

Apply the restores:

kubectl apply -f longhorn-restore-data.yaml
kubectl apply -f longhorn-restore-wal.yaml

# Monitor restore progress
kubectl get volumes -n longhorn-system | grep recovered

2.2 Create PersistentVolumes for Restored Data

# postgres-recovered-pvs.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: postgres-shared-data-recovered-pv
  annotations:
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
  capacity:
    storage: 400Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn-retain
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeAttributes:
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
    volumeHandle: postgres-shared-data-recovered
---
apiVersion: v1  
kind: PersistentVolume
metadata:
  name: postgres-shared-wal-recovered-pv
  annotations:
    pv.kubernetes.io/provisioned-by: driver.longhorn.io
spec:
  capacity:
    storage: 100Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: longhorn-retain
  csi:
    driver: driver.longhorn.io
    fsType: ext4
    volumeAttributes:
      numberOfReplicas: "2"
      staleReplicaTimeout: "30"
    volumeHandle: postgres-shared-wal-recovered

kubectl apply -f postgres-recovered-pvs.yaml

🎯 Step 3: Create Fresh Cluster with Volume Adoption

3.1 Create Adoption PVCs

# postgres-adoption-pvcs.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-shared-1
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: CloudNativePG adoption annotations
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceName: postgres-shared-1  
    cnpg.io/podRole: instance
    # 🔑 CRITICAL: Prevent volume binding to wrong PV
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 400Gi
  storageClassName: longhorn-retain
  # 🔑 CRITICAL: This will be updated to point to recovered data later
  volumeName: ""  # Leave empty initially
---
apiVersion: v1
kind: PersistentVolumeClaim  
metadata:
  name: postgres-shared-1-wal
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: CloudNativePG adoption annotations
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceName: postgres-shared-1
    cnpg.io/podRole: instance
    cnpg.io/pvcRole: wal
    volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: longhorn-retain
  # 🔑 CRITICAL: This will be updated to point to recovered WAL later
  volumeName: ""  # Leave empty initially

kubectl apply -f postgres-adoption-pvcs.yaml

3.2 Deploy Cluster in Hibernation Mode

🚨 CRITICAL: The cluster MUST start in hibernation to prevent initdb from erasing your data!

# postgres-shared-recovery.yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: postgres-shared
  namespace: postgresql-system
  annotations:
    # 🔑 CRITICAL: Hibernation prevents startup and data erasure
    cnpg.io/hibernation: "on"
spec:
  instances: 1
  
  # 🔑 CRITICAL: Single instance prevents replication conflicts during recovery
  minSyncReplicas: 0
  maxSyncReplicas: 0
  
  postgresql:
    parameters:
      # Performance and stability settings for recovery
      max_connections: "200"
      shared_buffers: "256MB" 
      effective_cache_size: "1GB"
      maintenance_work_mem: "64MB"
      checkpoint_completion_target: "0.9"
      wal_buffers: "16MB"
      default_statistics_target: "100"
      random_page_cost: "1.1"
      effective_io_concurrency: "200"
      
      # 🔑 CRITICAL: Minimal logging during recovery
      log_min_messages: "warning"
      log_min_error_statement: "error"
      log_statement: "none"

  bootstrap:
    # 🔑 CRITICAL: initdb bootstrap (NOT recovery mode)
    # This will run even under hibernation
    initdb:
      database: postgres
      owner: postgres
      
  storage:
    size: 400Gi
    storageClass: longhorn-retain
    
  walStorage:
    size: 100Gi
    storageClass: longhorn-retain

  # 🔑 CRITICAL: Extended timeouts for recovery scenarios
  startDelay: 3600  # 1 hour delay
  stopDelay: 1800   # 30 minute stop delay
  switchoverDelay: 1800  # 30 minute switchover delay

  monitoring:
    enabled: true
    
  # Backup configuration (restore after recovery)
  backup:
    retentionPolicy: "7d"
    barmanObjectStore:
      destinationPath: "s3://your-backup-bucket/postgres-shared"
      # Configure after cluster is stable

kubectl apply -f postgres-shared-recovery.yaml

# Verify cluster is hibernated (pods should NOT start)
kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Hibernation

🔗 Step 4: Retarget PVCs to Restored Data

4.1 Generate Fresh PV UUIDs

# Generate new UUIDs for PV/PVC binding
DATA_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
WAL_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')

echo "Data PV UUID: $DATA_PV_UUID"
echo "WAL PV UUID: $WAL_PV_UUID"

4.2 Patch PVs with Binding UUIDs

# Patch data PV
kubectl patch pv postgres-shared-data-recovered-pv -p "{
  \"metadata\": {
    \"uid\": \"$DATA_PV_UUID\"
  },
  \"spec\": {
    \"claimRef\": {
      \"name\": \"postgres-shared-1\",
      \"namespace\": \"postgresql-system\",
      \"uid\": \"$DATA_PV_UUID\"
    }
  }
}"

# Patch WAL PV  
kubectl patch pv postgres-shared-wal-recovered-pv -p "{
  \"metadata\": {
    \"uid\": \"$WAL_PV_UUID\"
  },
  \"spec\": {
    \"claimRef\": {
      \"name\": \"postgres-shared-1-wal\", 
      \"namespace\": \"postgresql-system\",
      \"uid\": \"$WAL_PV_UUID\"
    }
  }
}"

4.3 Patch PVCs with Matching UUIDs

# Patch data PVC
kubectl patch pvc postgres-shared-1 -n postgresql-system -p "{
  \"metadata\": {
    \"uid\": \"$DATA_PV_UUID\"
  },
  \"spec\": {
    \"volumeName\": \"postgres-shared-data-recovered-pv\"
  }
}"

# Patch WAL PVC
kubectl patch pvc postgres-shared-1-wal -n postgresql-system -p "{
  \"metadata\": {
    \"uid\": \"$WAL_PV_UUID\" 
  },
  \"spec\": {
    \"volumeName\": \"postgres-shared-wal-recovered-pv\"
  }
}"

4.4 Verify PVC Binding

kubectl get pvc -n postgresql-system
# Both PVCs should show STATUS = Bound

🌅 Step 5: Wake Cluster from Hibernation

5.1 Remove Hibernation Annotation

# 🔑 CRITICAL: This starts the cluster with your restored data
kubectl annotate cluster postgres-shared -n postgresql-system cnpg.io/hibernation-

# Monitor cluster startup
kubectl get cluster postgres-shared -n postgresql-system -w

5.2 Monitor Pod Startup

# Watch pod creation and startup
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w

# Check logs for successful data adoption
kubectl logs postgres-shared-1 -n postgresql-system -f

🔍 Expected Log Messages:

INFO: PostgreSQL Database directory appears to contain a database
INFO: Looking at the contents of PostgreSQL database directory
INFO: Database found, skipping initialization
INFO: Starting PostgreSQL with recovered data

🔍 Step 6: Verify Data Recovery

6.1 Check Cluster Status

kubectl get cluster postgres-shared -n postgresql-system
# Should show: STATUS = Cluster in healthy state, PRIMARY = postgres-shared-1

6.2 Test Database Connectivity

# Test connection
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "\l"

# Verify all application databases exist
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size 
FROM pg_database 
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY pg_database_size(datname) DESC;
"

6.3 Verify Application Data

# Test specific application tables (example for Mastodon)
kubectl exec postgres-shared-1 -n postgresql-system -- psql mastodon_production -c "
SELECT COUNT(*) as total_accounts FROM accounts;
SELECT COUNT(*) as total_statuses FROM statuses;
"

📈 Step 7: Scale to High Availability (Optional)

7.1 Enable Replica Creation

# Scale cluster to 2 instances for HA
kubectl patch cluster postgres-shared -n postgresql-system -p '{
  "spec": {
    "instances": 2,
    "minSyncReplicas": 0,
    "maxSyncReplicas": 1
  }
}'

7.2 Monitor Replica Join

# Watch replica creation and sync
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w

# Monitor replication lag
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       write_lag, flush_lag, replay_lag 
FROM pg_stat_replication;
"

🔧 Step 8: Application Connectivity (Service Aliases)

8.1 Create Service Aliases for Application Compatibility

If your applications expect different service names (e.g., postgresql-shared-* vs postgres-shared-*):

# postgresql-service-aliases.yaml
apiVersion: v1
kind: Service
metadata:
  name: postgresql-shared-rw
  namespace: postgresql-system
  labels:
    cnpg.io/cluster: postgres-shared
spec:
  type: ClusterIP
  ports:
  - name: postgres
    port: 5432
    protocol: TCP
    targetPort: 5432
  selector:
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceRole: primary
---
apiVersion: v1
kind: Service
metadata:
  name: postgresql-shared-ro
  namespace: postgresql-system
  labels:
    cnpg.io/cluster: postgres-shared
spec:
  type: ClusterIP
  ports:
  - name: postgres
    port: 5432
    protocol: TCP 
    targetPort: 5432
  selector:
    cnpg.io/cluster: postgres-shared
    cnpg.io/instanceRole: replica

kubectl apply -f postgresql-service-aliases.yaml

8.2 Test Application Connectivity

# Test from application namespace
kubectl run test-connectivity --image=busybox --rm -it -- nc -zv postgresql-shared-rw.postgresql-system.svc.cluster.local 5432

🚨 Troubleshooting Common Issues

Issue 1: Cluster Starts in initdb Mode (Data Loss Risk!)

Symptoms: Logs show "Initializing empty database" Solution:

IMMEDIATELY scale cluster to 0 instances
Verify PVC adoption annotations are correct
Check that hibernation was properly used

kubectl patch cluster postgres-shared -n postgresql-system -p '{"spec":{"instances":0}}'

Issue 2: PVC Binding Fails

Symptoms: PVCs stuck in "Pending" state Solution:

Check PV/PVC UUID matching
Verify PV claimRef points to correct PVC
Ensure storage class exists

kubectl describe pvc postgres-shared-1 -n postgresql-system
kubectl describe pv postgres-shared-data-recovered-pv

Issue 3: Pod Restart Loops

Symptoms: Pod continuously restarting with health check failures Solutions:

Check Cilium network policies allow PostgreSQL traffic
Verify PostgreSQL data directory permissions
Check for TLS/SSL configuration issues

# Fix common permission issues
kubectl exec postgres-shared-1 -n postgresql-system -- chown -R postgres:postgres /var/lib/postgresql/data

Issue 4: Replica Won't Join

Symptoms: Second instance fails to join with replication errors Solutions:

Check primary is stable before adding replica
Verify network connectivity between pods
Monitor WAL streaming logs

# Check replication status
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "SELECT * FROM pg_stat_replication;"

📋 Recovery Checklist

Pre-Recovery:

Backup current cluster state (if any)
Identify Longhorn backup volume names
Prepare fresh namespace if needed
Verify Longhorn operator is functional

Volume Restoration:

Restore data volume from Longhorn backup
Restore WAL volume from Longhorn backup
Create PersistentVolumes for restored data
Verify volumes are healthy in Longhorn UI

Cluster Recovery:

Create adoption PVCs with correct annotations
Deploy cluster in hibernation mode
Generate and assign PV/PVC UUIDs
Patch PVs with claimRef binding
Patch PVCs with volumeName binding
Verify PVC binding before proceeding

Startup:

Remove hibernation annotation
Monitor pod startup logs for data adoption
Verify cluster reaches healthy state
Test database connectivity

Validation:

Verify all application databases exist
Test application table row counts
Check database sizes match expectations
Test application connectivity

HA Setup (Optional):

Scale to 2+ instances
Monitor replica join process
Verify replication is working
Test failover scenarios

Cleanup:

Remove temporary PVs/PVCs
Update backup configurations
Document any configuration changes
Test regular backup/restore procedures

⚠️ CRITICAL SUCCESS FACTORS

🔑 Hibernation is MANDATORY: Never start a cluster without hibernation when adopting existing data
🔑 Single Instance First: Always recover to single instance, then scale to HA
🔑 UUID Matching: PV and PVC UIDs must match exactly for binding
🔑 Adoption Annotations: CloudNativePG annotations must be present on PVCs
🔑 Volume Naming: PVC names must match CloudNativePG instance naming convention
🔑 Network Policies: Ensure Cilium policies allow PostgreSQL traffic
🔑 Monitor Logs: Watch startup logs carefully for data adoption confirmation

📚 Additional Resources

🎉 This disaster recovery procedure has been tested and proven successful in production environments!

16 KiB Raw Blame History