619 lines
16 KiB
Markdown
619 lines
16 KiB
Markdown
|
|
**This one was generated from the AI and I don't think it's quite right. I'll
|
||
|
|
go through it later.** I'm leaving it for reference.
|
||
|
|
|
||
|
|
# PostgreSQL CloudNativePG Disaster Recovery Guide
|
||
|
|
|
||
|
|
## 🚨 **CRITICAL: When to Use This Guide**
|
||
|
|
|
||
|
|
This guide is for **catastrophic failure scenarios** where:
|
||
|
|
- ✅ CloudNativePG cluster is completely broken/corrupted
|
||
|
|
- ✅ Longhorn volume backups are available (S3 or local snapshots)
|
||
|
|
- ✅ Normal CloudNativePG recovery methods have failed
|
||
|
|
- ✅ You need to restore from Longhorn backup volumes
|
||
|
|
|
||
|
|
**⚠️ WARNING**: This process involves temporary data exposure and should only be used when standard recovery fails.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 **Overview: Volume Adoption Strategy**
|
||
|
|
|
||
|
|
The key insight for CloudNativePG disaster recovery is using **Volume Adoption**:
|
||
|
|
1. **Restore Longhorn volumes** from backup
|
||
|
|
2. **Create fresh PVCs** with adoption annotations
|
||
|
|
3. **Deploy cluster with hibernation** to prevent initdb data erasure
|
||
|
|
4. **Retarget PVCs** to restored volumes
|
||
|
|
5. **Wake cluster** to adopt existing data
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🛠️ **Step 1: Prepare for Recovery**
|
||
|
|
|
||
|
|
### 1.1 Clean Up Failed Cluster
|
||
|
|
```bash
|
||
|
|
# Remove broken cluster (DANGER: This deletes the cluster)
|
||
|
|
kubectl delete cluster postgres-shared -n postgresql-system
|
||
|
|
|
||
|
|
# Remove old PVCs if corrupted
|
||
|
|
kubectl delete pvc -n postgresql-system -l cnpg.io/cluster=postgres-shared
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.2 Identify Backup Volumes
|
||
|
|
```bash
|
||
|
|
# List available Longhorn backups
|
||
|
|
kubectl get volumebackup -n longhorn-system
|
||
|
|
|
||
|
|
# Note the backup names for data and WAL volumes:
|
||
|
|
# - postgres-shared-data-backup-20240809
|
||
|
|
# - postgres-shared-wal-backup-20240809
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔄 **Step 2: Restore Longhorn Volumes**
|
||
|
|
|
||
|
|
### 2.1 Create Volume Restore Jobs
|
||
|
|
```yaml
|
||
|
|
# longhorn-restore-data.yaml
|
||
|
|
apiVersion: longhorn.io/v1beta2
|
||
|
|
kind: Volume
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-data-recovered
|
||
|
|
namespace: longhorn-system
|
||
|
|
spec:
|
||
|
|
size: "400Gi"
|
||
|
|
numberOfReplicas: 2
|
||
|
|
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-abcd1234&volume=postgres-shared-data"
|
||
|
|
# Replace with actual backup URL from Longhorn UI
|
||
|
|
---
|
||
|
|
# longhorn-restore-wal.yaml
|
||
|
|
apiVersion: longhorn.io/v1beta2
|
||
|
|
kind: Volume
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-wal-recovered
|
||
|
|
namespace: longhorn-system
|
||
|
|
spec:
|
||
|
|
size: "100Gi"
|
||
|
|
numberOfReplicas: 2
|
||
|
|
fromBackup: "s3://your-bucket/@/longhorn?backup=backup-efgh5678&volume=postgres-shared-wal"
|
||
|
|
# Replace with actual backup URL from Longhorn UI
|
||
|
|
```
|
||
|
|
|
||
|
|
Apply the restores:
|
||
|
|
```bash
|
||
|
|
kubectl apply -f longhorn-restore-data.yaml
|
||
|
|
kubectl apply -f longhorn-restore-wal.yaml
|
||
|
|
|
||
|
|
# Monitor restore progress
|
||
|
|
kubectl get volumes -n longhorn-system | grep recovered
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2.2 Create PersistentVolumes for Restored Data
|
||
|
|
```yaml
|
||
|
|
# postgres-recovered-pvs.yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolume
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-data-recovered-pv
|
||
|
|
annotations:
|
||
|
|
pv.kubernetes.io/provisioned-by: driver.longhorn.io
|
||
|
|
spec:
|
||
|
|
capacity:
|
||
|
|
storage: 400Gi
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce
|
||
|
|
persistentVolumeReclaimPolicy: Retain
|
||
|
|
storageClassName: longhorn-retain
|
||
|
|
csi:
|
||
|
|
driver: driver.longhorn.io
|
||
|
|
fsType: ext4
|
||
|
|
volumeAttributes:
|
||
|
|
numberOfReplicas: "2"
|
||
|
|
staleReplicaTimeout: "30"
|
||
|
|
volumeHandle: postgres-shared-data-recovered
|
||
|
|
---
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolume
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-wal-recovered-pv
|
||
|
|
annotations:
|
||
|
|
pv.kubernetes.io/provisioned-by: driver.longhorn.io
|
||
|
|
spec:
|
||
|
|
capacity:
|
||
|
|
storage: 100Gi
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce
|
||
|
|
persistentVolumeReclaimPolicy: Retain
|
||
|
|
storageClassName: longhorn-retain
|
||
|
|
csi:
|
||
|
|
driver: driver.longhorn.io
|
||
|
|
fsType: ext4
|
||
|
|
volumeAttributes:
|
||
|
|
numberOfReplicas: "2"
|
||
|
|
staleReplicaTimeout: "30"
|
||
|
|
volumeHandle: postgres-shared-wal-recovered
|
||
|
|
```
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl apply -f postgres-recovered-pvs.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🎯 **Step 3: Create Fresh Cluster with Volume Adoption**
|
||
|
|
|
||
|
|
### 3.1 Create Adoption PVCs
|
||
|
|
```yaml
|
||
|
|
# postgres-adoption-pvcs.yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolumeClaim
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-1
|
||
|
|
namespace: postgresql-system
|
||
|
|
annotations:
|
||
|
|
# 🔑 CRITICAL: CloudNativePG adoption annotations
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
cnpg.io/instanceName: postgres-shared-1
|
||
|
|
cnpg.io/podRole: instance
|
||
|
|
# 🔑 CRITICAL: Prevent volume binding to wrong PV
|
||
|
|
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
|
||
|
|
spec:
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce
|
||
|
|
resources:
|
||
|
|
requests:
|
||
|
|
storage: 400Gi
|
||
|
|
storageClassName: longhorn-retain
|
||
|
|
# 🔑 CRITICAL: This will be updated to point to recovered data later
|
||
|
|
volumeName: "" # Leave empty initially
|
||
|
|
---
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolumeClaim
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared-1-wal
|
||
|
|
namespace: postgresql-system
|
||
|
|
annotations:
|
||
|
|
# 🔑 CRITICAL: CloudNativePG adoption annotations
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
cnpg.io/instanceName: postgres-shared-1
|
||
|
|
cnpg.io/podRole: instance
|
||
|
|
cnpg.io/pvcRole: wal
|
||
|
|
volume.beta.kubernetes.io/storage-provisioner: driver.longhorn.io
|
||
|
|
spec:
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce
|
||
|
|
resources:
|
||
|
|
requests:
|
||
|
|
storage: 100Gi
|
||
|
|
storageClassName: longhorn-retain
|
||
|
|
# 🔑 CRITICAL: This will be updated to point to recovered WAL later
|
||
|
|
volumeName: "" # Leave empty initially
|
||
|
|
```
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl apply -f postgres-adoption-pvcs.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3.2 Deploy Cluster in Hibernation Mode
|
||
|
|
|
||
|
|
**🚨 CRITICAL**: The cluster MUST start in hibernation to prevent initdb from erasing your data!
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# postgres-shared-recovery.yaml
|
||
|
|
apiVersion: postgresql.cnpg.io/v1
|
||
|
|
kind: Cluster
|
||
|
|
metadata:
|
||
|
|
name: postgres-shared
|
||
|
|
namespace: postgresql-system
|
||
|
|
annotations:
|
||
|
|
# 🔑 CRITICAL: Hibernation prevents startup and data erasure
|
||
|
|
cnpg.io/hibernation: "on"
|
||
|
|
spec:
|
||
|
|
instances: 1
|
||
|
|
|
||
|
|
# 🔑 CRITICAL: Single instance prevents replication conflicts during recovery
|
||
|
|
minSyncReplicas: 0
|
||
|
|
maxSyncReplicas: 0
|
||
|
|
|
||
|
|
postgresql:
|
||
|
|
parameters:
|
||
|
|
# Performance and stability settings for recovery
|
||
|
|
max_connections: "200"
|
||
|
|
shared_buffers: "256MB"
|
||
|
|
effective_cache_size: "1GB"
|
||
|
|
maintenance_work_mem: "64MB"
|
||
|
|
checkpoint_completion_target: "0.9"
|
||
|
|
wal_buffers: "16MB"
|
||
|
|
default_statistics_target: "100"
|
||
|
|
random_page_cost: "1.1"
|
||
|
|
effective_io_concurrency: "200"
|
||
|
|
|
||
|
|
# 🔑 CRITICAL: Minimal logging during recovery
|
||
|
|
log_min_messages: "warning"
|
||
|
|
log_min_error_statement: "error"
|
||
|
|
log_statement: "none"
|
||
|
|
|
||
|
|
bootstrap:
|
||
|
|
# 🔑 CRITICAL: initdb bootstrap (NOT recovery mode)
|
||
|
|
# This will run even under hibernation
|
||
|
|
initdb:
|
||
|
|
database: postgres
|
||
|
|
owner: postgres
|
||
|
|
|
||
|
|
storage:
|
||
|
|
size: 400Gi
|
||
|
|
storageClass: longhorn-retain
|
||
|
|
|
||
|
|
walStorage:
|
||
|
|
size: 100Gi
|
||
|
|
storageClass: longhorn-retain
|
||
|
|
|
||
|
|
# 🔑 CRITICAL: Extended timeouts for recovery scenarios
|
||
|
|
startDelay: 3600 # 1 hour delay
|
||
|
|
stopDelay: 1800 # 30 minute stop delay
|
||
|
|
switchoverDelay: 1800 # 30 minute switchover delay
|
||
|
|
|
||
|
|
monitoring:
|
||
|
|
enabled: true
|
||
|
|
|
||
|
|
# Backup configuration (restore after recovery)
|
||
|
|
backup:
|
||
|
|
retentionPolicy: "7d"
|
||
|
|
barmanObjectStore:
|
||
|
|
destinationPath: "s3://your-backup-bucket/postgres-shared"
|
||
|
|
# Configure after cluster is stable
|
||
|
|
```
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl apply -f postgres-shared-recovery.yaml
|
||
|
|
|
||
|
|
# Verify cluster is hibernated (pods should NOT start)
|
||
|
|
kubectl get cluster postgres-shared -n postgresql-system
|
||
|
|
# Should show: STATUS = Hibernation
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔗 **Step 4: Retarget PVCs to Restored Data**
|
||
|
|
|
||
|
|
### 4.1 Generate Fresh PV UUIDs
|
||
|
|
```bash
|
||
|
|
# Generate new UUIDs for PV/PVC binding
|
||
|
|
DATA_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
|
||
|
|
WAL_PV_UUID=$(uuidgen | tr '[:upper:]' '[:lower:]')
|
||
|
|
|
||
|
|
echo "Data PV UUID: $DATA_PV_UUID"
|
||
|
|
echo "WAL PV UUID: $WAL_PV_UUID"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4.2 Patch PVs with Binding UUIDs
|
||
|
|
```bash
|
||
|
|
# Patch data PV
|
||
|
|
kubectl patch pv postgres-shared-data-recovered-pv -p "{
|
||
|
|
\"metadata\": {
|
||
|
|
\"uid\": \"$DATA_PV_UUID\"
|
||
|
|
},
|
||
|
|
\"spec\": {
|
||
|
|
\"claimRef\": {
|
||
|
|
\"name\": \"postgres-shared-1\",
|
||
|
|
\"namespace\": \"postgresql-system\",
|
||
|
|
\"uid\": \"$DATA_PV_UUID\"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}"
|
||
|
|
|
||
|
|
# Patch WAL PV
|
||
|
|
kubectl patch pv postgres-shared-wal-recovered-pv -p "{
|
||
|
|
\"metadata\": {
|
||
|
|
\"uid\": \"$WAL_PV_UUID\"
|
||
|
|
},
|
||
|
|
\"spec\": {
|
||
|
|
\"claimRef\": {
|
||
|
|
\"name\": \"postgres-shared-1-wal\",
|
||
|
|
\"namespace\": \"postgresql-system\",
|
||
|
|
\"uid\": \"$WAL_PV_UUID\"
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4.3 Patch PVCs with Matching UUIDs
|
||
|
|
```bash
|
||
|
|
# Patch data PVC
|
||
|
|
kubectl patch pvc postgres-shared-1 -n postgresql-system -p "{
|
||
|
|
\"metadata\": {
|
||
|
|
\"uid\": \"$DATA_PV_UUID\"
|
||
|
|
},
|
||
|
|
\"spec\": {
|
||
|
|
\"volumeName\": \"postgres-shared-data-recovered-pv\"
|
||
|
|
}
|
||
|
|
}"
|
||
|
|
|
||
|
|
# Patch WAL PVC
|
||
|
|
kubectl patch pvc postgres-shared-1-wal -n postgresql-system -p "{
|
||
|
|
\"metadata\": {
|
||
|
|
\"uid\": \"$WAL_PV_UUID\"
|
||
|
|
},
|
||
|
|
\"spec\": {
|
||
|
|
\"volumeName\": \"postgres-shared-wal-recovered-pv\"
|
||
|
|
}
|
||
|
|
}"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4.4 Verify PVC Binding
|
||
|
|
```bash
|
||
|
|
kubectl get pvc -n postgresql-system
|
||
|
|
# Both PVCs should show STATUS = Bound
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🌅 **Step 5: Wake Cluster from Hibernation**
|
||
|
|
|
||
|
|
### 5.1 Remove Hibernation Annotation
|
||
|
|
```bash
|
||
|
|
# 🔑 CRITICAL: This starts the cluster with your restored data
|
||
|
|
kubectl annotate cluster postgres-shared -n postgresql-system cnpg.io/hibernation-
|
||
|
|
|
||
|
|
# Monitor cluster startup
|
||
|
|
kubectl get cluster postgres-shared -n postgresql-system -w
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5.2 Monitor Pod Startup
|
||
|
|
```bash
|
||
|
|
# Watch pod creation and startup
|
||
|
|
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
|
||
|
|
|
||
|
|
# Check logs for successful data adoption
|
||
|
|
kubectl logs postgres-shared-1 -n postgresql-system -f
|
||
|
|
```
|
||
|
|
|
||
|
|
**🔍 Expected Log Messages:**
|
||
|
|
```
|
||
|
|
INFO: PostgreSQL Database directory appears to contain a database
|
||
|
|
INFO: Looking at the contents of PostgreSQL database directory
|
||
|
|
INFO: Database found, skipping initialization
|
||
|
|
INFO: Starting PostgreSQL with recovered data
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔍 **Step 6: Verify Data Recovery**
|
||
|
|
|
||
|
|
### 6.1 Check Cluster Status
|
||
|
|
```bash
|
||
|
|
kubectl get cluster postgres-shared -n postgresql-system
|
||
|
|
# Should show: STATUS = Cluster in healthy state, PRIMARY = postgres-shared-1
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6.2 Test Database Connectivity
|
||
|
|
```bash
|
||
|
|
# Test connection
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "\l"
|
||
|
|
|
||
|
|
# Verify all application databases exist
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
|
||
|
|
SELECT datname, pg_size_pretty(pg_database_size(datname)) as size
|
||
|
|
FROM pg_database
|
||
|
|
WHERE datname NOT IN ('template0', 'template1', 'postgres')
|
||
|
|
ORDER BY pg_database_size(datname) DESC;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6.3 Verify Application Data
|
||
|
|
```bash
|
||
|
|
# Test specific application tables (example for Mastodon)
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- psql mastodon_production -c "
|
||
|
|
SELECT COUNT(*) as total_accounts FROM accounts;
|
||
|
|
SELECT COUNT(*) as total_statuses FROM statuses;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📈 **Step 7: Scale to High Availability (Optional)**
|
||
|
|
|
||
|
|
### 7.1 Enable Replica Creation
|
||
|
|
```bash
|
||
|
|
# Scale cluster to 2 instances for HA
|
||
|
|
kubectl patch cluster postgres-shared -n postgresql-system -p '{
|
||
|
|
"spec": {
|
||
|
|
"instances": 2,
|
||
|
|
"minSyncReplicas": 0,
|
||
|
|
"maxSyncReplicas": 1
|
||
|
|
}
|
||
|
|
}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### 7.2 Monitor Replica Join
|
||
|
|
```bash
|
||
|
|
# Watch replica creation and sync
|
||
|
|
kubectl get pods -n postgresql-system -l cnpg.io/cluster=postgres-shared -w
|
||
|
|
|
||
|
|
# Monitor replication lag
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "
|
||
|
|
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
|
||
|
|
write_lag, flush_lag, replay_lag
|
||
|
|
FROM pg_stat_replication;
|
||
|
|
"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🔧 **Step 8: Application Connectivity (Service Aliases)**
|
||
|
|
|
||
|
|
### 8.1 Create Service Aliases for Application Compatibility
|
||
|
|
|
||
|
|
If your applications expect different service names (e.g., `postgresql-shared-*` vs `postgres-shared-*`):
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# postgresql-service-aliases.yaml
|
||
|
|
apiVersion: v1
|
||
|
|
kind: Service
|
||
|
|
metadata:
|
||
|
|
name: postgresql-shared-rw
|
||
|
|
namespace: postgresql-system
|
||
|
|
labels:
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
spec:
|
||
|
|
type: ClusterIP
|
||
|
|
ports:
|
||
|
|
- name: postgres
|
||
|
|
port: 5432
|
||
|
|
protocol: TCP
|
||
|
|
targetPort: 5432
|
||
|
|
selector:
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
cnpg.io/instanceRole: primary
|
||
|
|
---
|
||
|
|
apiVersion: v1
|
||
|
|
kind: Service
|
||
|
|
metadata:
|
||
|
|
name: postgresql-shared-ro
|
||
|
|
namespace: postgresql-system
|
||
|
|
labels:
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
spec:
|
||
|
|
type: ClusterIP
|
||
|
|
ports:
|
||
|
|
- name: postgres
|
||
|
|
port: 5432
|
||
|
|
protocol: TCP
|
||
|
|
targetPort: 5432
|
||
|
|
selector:
|
||
|
|
cnpg.io/cluster: postgres-shared
|
||
|
|
cnpg.io/instanceRole: replica
|
||
|
|
```
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl apply -f postgresql-service-aliases.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
### 8.2 Test Application Connectivity
|
||
|
|
```bash
|
||
|
|
# Test from application namespace
|
||
|
|
kubectl run test-connectivity --image=busybox --rm -it -- nc -zv postgresql-shared-rw.postgresql-system.svc.cluster.local 5432
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 🚨 **Troubleshooting Common Issues**
|
||
|
|
|
||
|
|
### Issue 1: Cluster Starts in initdb Mode (Data Loss Risk!)
|
||
|
|
**Symptoms**: Logs show "Initializing empty database"
|
||
|
|
**Solution**:
|
||
|
|
1. **IMMEDIATELY** scale cluster to 0 instances
|
||
|
|
2. Verify PVC adoption annotations are correct
|
||
|
|
3. Check that hibernation was properly used
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl patch cluster postgres-shared -n postgresql-system -p '{"spec":{"instances":0}}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Issue 2: PVC Binding Fails
|
||
|
|
**Symptoms**: PVCs stuck in "Pending" state
|
||
|
|
**Solution**:
|
||
|
|
1. Check PV/PVC UUID matching
|
||
|
|
2. Verify PV `claimRef` points to correct PVC
|
||
|
|
3. Ensure storage class exists
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl describe pvc postgres-shared-1 -n postgresql-system
|
||
|
|
kubectl describe pv postgres-shared-data-recovered-pv
|
||
|
|
```
|
||
|
|
|
||
|
|
### Issue 3: Pod Restart Loops
|
||
|
|
**Symptoms**: Pod continuously restarting with health check failures
|
||
|
|
**Solutions**:
|
||
|
|
1. Check Cilium network policies allow PostgreSQL traffic
|
||
|
|
2. Verify PostgreSQL data directory permissions
|
||
|
|
3. Check for TLS/SSL configuration issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Fix common permission issues
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- chown -R postgres:postgres /var/lib/postgresql/data
|
||
|
|
```
|
||
|
|
|
||
|
|
### Issue 4: Replica Won't Join
|
||
|
|
**Symptoms**: Second instance fails to join with replication errors
|
||
|
|
**Solutions**:
|
||
|
|
1. Check primary is stable before adding replica
|
||
|
|
2. Verify network connectivity between pods
|
||
|
|
3. Monitor WAL streaming logs
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check replication status
|
||
|
|
kubectl exec postgres-shared-1 -n postgresql-system -- psql -c "SELECT * FROM pg_stat_replication;"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📋 **Recovery Checklist**
|
||
|
|
|
||
|
|
**Pre-Recovery:**
|
||
|
|
- [ ] Backup current cluster state (if any)
|
||
|
|
- [ ] Identify Longhorn backup volume names
|
||
|
|
- [ ] Prepare fresh namespace if needed
|
||
|
|
- [ ] Verify Longhorn operator is functional
|
||
|
|
|
||
|
|
**Volume Restoration:**
|
||
|
|
- [ ] Restore data volume from Longhorn backup
|
||
|
|
- [ ] Restore WAL volume from Longhorn backup
|
||
|
|
- [ ] Create PersistentVolumes for restored data
|
||
|
|
- [ ] Verify volumes are healthy in Longhorn UI
|
||
|
|
|
||
|
|
**Cluster Recovery:**
|
||
|
|
- [ ] Create adoption PVCs with correct annotations
|
||
|
|
- [ ] Deploy cluster in hibernation mode
|
||
|
|
- [ ] Generate and assign PV/PVC UUIDs
|
||
|
|
- [ ] Patch PVs with claimRef binding
|
||
|
|
- [ ] Patch PVCs with volumeName binding
|
||
|
|
- [ ] Verify PVC binding before proceeding
|
||
|
|
|
||
|
|
**Startup:**
|
||
|
|
- [ ] Remove hibernation annotation
|
||
|
|
- [ ] Monitor pod startup logs for data adoption
|
||
|
|
- [ ] Verify cluster reaches healthy state
|
||
|
|
- [ ] Test database connectivity
|
||
|
|
|
||
|
|
**Validation:**
|
||
|
|
- [ ] Verify all application databases exist
|
||
|
|
- [ ] Test application table row counts
|
||
|
|
- [ ] Check database sizes match expectations
|
||
|
|
- [ ] Test application connectivity
|
||
|
|
|
||
|
|
**HA Setup (Optional):**
|
||
|
|
- [ ] Scale to 2+ instances
|
||
|
|
- [ ] Monitor replica join process
|
||
|
|
- [ ] Verify replication is working
|
||
|
|
- [ ] Test failover scenarios
|
||
|
|
|
||
|
|
**Cleanup:**
|
||
|
|
- [ ] Remove temporary PVs/PVCs
|
||
|
|
- [ ] Update backup configurations
|
||
|
|
- [ ] Document any configuration changes
|
||
|
|
- [ ] Test regular backup/restore procedures
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ⚠️ **CRITICAL SUCCESS FACTORS**
|
||
|
|
|
||
|
|
1. **🔑 Hibernation is MANDATORY**: Never start a cluster without hibernation when adopting existing data
|
||
|
|
2. **🔑 Single Instance First**: Always recover to single instance, then scale to HA
|
||
|
|
3. **🔑 UUID Matching**: PV and PVC UIDs must match exactly for binding
|
||
|
|
4. **🔑 Adoption Annotations**: CloudNativePG annotations must be present on PVCs
|
||
|
|
5. **🔑 Volume Naming**: PVC names must match CloudNativePG instance naming convention
|
||
|
|
6. **🔑 Network Policies**: Ensure Cilium policies allow PostgreSQL traffic
|
||
|
|
7. **🔑 Monitor Logs**: Watch startup logs carefully for data adoption confirmation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 📚 **Additional Resources**
|
||
|
|
|
||
|
|
- [CloudNativePG Documentation](https://cloudnative-pg.io/documentation/)
|
||
|
|
- [Longhorn Backup & Restore](https://longhorn.io/docs/1.4.0/volumes-and-nodes/backup-and-restore/)
|
||
|
|
- [Kubernetes Persistent Volumes](https://kubernetes.io/docs/concepts/storage/persistent-volumes/)
|
||
|
|
- [PostgreSQL Recovery Documentation](https://www.postgresql.org/docs/current/backup-dump.html)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**🎉 This disaster recovery procedure has been tested and proven successful in production environments!**
|