Files
Keybard-Vagabond-Demo/manifests/infrastructure/postgresql/README.md

341 lines
12 KiB
Markdown
Raw Normal View History

# PostgreSQL Infrastructure
This directory contains the CloudNativePG setup for high-availability PostgreSQL on the Kubernetes cluster.
## Architecture
- **3 PostgreSQL instances**: 1 primary + 2 replicas for high availability
- **Synchronous replication**: Zero data loss (RPO=0) configuration
- **Node distribution**: Instances distributed across n1, n2, and n3 nodes
- **Current cluster**: `postgres-shared` with instances `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5`
- **Longhorn storage**: Single replica (PostgreSQL handles replication)
- **Shared cluster**: One PostgreSQL cluster that applications can share
## Components
### **Core Components**
- `namespace.yaml`: PostgreSQL system namespace
- `repository.yaml`: CloudNativePG Helm repository
- `operator.yaml`: CloudNativePG operator deployment
- `postgresql-storageclass.yaml`: Optimized storage class for PostgreSQL
- `cluster-shared.yaml`: Shared PostgreSQL cluster configuration
### **Monitoring Components**
- `postgresql-dashboard-metrics.yaml`: Custom metrics ConfigMap for enhanced monitoring
- `postgresql-dashboard-rbac.yaml`: RBAC permissions for metrics collection
- Built-in ServiceMonitor: Automatically configured for OpenObserve integration
### **Backup Components**
- `backup-config.yaml`: CloudNativePG backup configuration
- Longhorn integration: S3 backup via label-based volume selection
## Services Created
CloudNativePG automatically creates these services:
- `postgresql-shared-rw`: Write operations (connects to primary)
- `postgresql-shared-ro`: Read-only operations (connects to replicas)
- `postgresql-shared-r`: Read operations (connects to any instance)
## Connection Information
### For Applications
Applications should connect using these connection parameters:
**Write Operations:**
```yaml
host: postgresql-shared-rw.postgresql-system.svc.cluster.local
port: 5432
database: shared_db
username: shared_user
```
**Read Operations:**
```yaml
host: postgresql-shared-ro.postgresql-system.svc.cluster.local
port: 5432
database: shared_db
username: shared_user
```
### Getting Credentials
The PostgreSQL password is auto-generated and stored in a secret:
```bash
# Get the password for the shared_user
kubectl get secret postgresql-shared-app -n postgresql-system -o jsonpath="{.data.password}" | base64 -d
# Get the superuser password
kubectl get secret postgresql-shared-superuser -n postgresql-system -o jsonpath="{.data.password}" | base64 -d
```
## Application Integration Example
Here's how an application deployment would connect:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-app
spec:
template:
spec:
containers:
- name: app
image: example-app:latest
env:
- name: DB_HOST
value: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
- name: DB_PORT
value: "5432"
- name: DB_NAME
value: "shared_db"
- name: DB_USER
value: "shared_user"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: postgresql-shared-app
key: password
```
## Monitoring
The PostgreSQL cluster includes comprehensive monitoring and observability:
### **Metrics & Monitoring** ✅ **OPERATIONAL**
- **Metrics Port**: 9187 - PostgreSQL metrics endpoint
- **ServiceMonitor**: Configured for OpenObserve integration
- **Built-in Metrics**: CloudNativePG provides extensive default metrics including:
- **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}`
- **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback`
- **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit`, `cnpg_pg_stat_database_blks_read`
- **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version`
- **Replication**: `cnpg_pg_stat_replication_*` metrics for streaming replication status
### **Custom Metrics System**
- **ConfigMap Support**: Custom queries can be defined via ConfigMaps
- **RBAC Configured**: PostgreSQL service account has permissions to read custom metrics ConfigMaps
- **Predefined Queries**: CloudNativePG includes `cnpg-default-monitoring` ConfigMap with standard queries
- **Monitoring Role**: Uses `pg_monitor` role for secure metrics collection
### **Dashboard Integration**
- **OpenObserve Ready**: All metrics automatically ingested into OpenObserve
- **Key Performance Indicators**:
- Connection utilization: `cnpg_backends_total / cnpg_pg_settings_setting{name="max_connections"} * 100`
- Buffer cache hit ratio: `cnpg_pg_stat_database_blks_hit / (cnpg_pg_stat_database_blks_hit + cnpg_pg_stat_database_blks_read) * 100`
- Transaction rate: `rate(cnpg_pg_stat_database_xact_commit[5m])`
- Rollback ratio: `cnpg_pg_stat_database_xact_rollback / (cnpg_pg_stat_database_xact_commit + cnpg_pg_stat_database_xact_rollback) * 100`
### **High Availability Monitoring**
- **Automatic Failover**: CloudNativePG handles primary/replica failover automatically
- **Health Checks**: Continuous health monitoring with automatic recovery
- **Streaming Replication**: Real-time replication status monitoring
## Backup Strategy
### **Longhorn Storage-Level Backups (Incremental)**
- **Daily backups**: 2 AM UTC, retain 14 days (2 weeks)
- **Weekly backups**: 1 AM Sunday, retain 8 weeks (2 months)
- **Snapshot cleanup**: 3 AM daily, keep 5 local snapshots
- **Target**: Backblaze B2 S3 storage via existing setup
- **Type**: Incremental (efficient change block detection)
### **CloudNativePG Application-Level Backups**
- **WAL archiving**: Continuous transaction log archiving
- **Point-in-time recovery**: Available via CloudNativePG
- **Retention**: 30-day backup retention policy
### **Backup Labels**
PostgreSQL volumes are automatically backed up based on labels:
```yaml
backup.longhorn.io/enable: "true"
app: postgresql-shared
```
## Scaling
To add more read replicas:
```yaml
# Edit cluster-shared.yaml
spec:
instances: 4 # Increase from 3 to 4 for additional read replica
```
## Troubleshooting
### **Cluster Status**
```bash
# Check cluster status
kubectl get cluster -n postgresql-system
kubectl describe cluster postgresql-shared -n postgresql-system
# Check pods
kubectl get pods -n postgresql-system
kubectl logs postgres-shared-2 -n postgresql-system # Current primary
```
### **Monitoring & Metrics**
```bash
# Check ServiceMonitor
kubectl get servicemonitor -n postgresql-system
kubectl describe servicemonitor postgresql-shared -n postgresql-system
# Check metrics endpoint directly
kubectl port-forward -n postgresql-system postgres-shared-2 9187:9187 # Primary instance
curl http://localhost:9187/metrics
# Check custom metrics ConfigMap
kubectl get configmap -n postgresql-system
kubectl describe configmap postgresql-dashboard-metrics -n postgresql-system
# Check RBAC permissions
kubectl get role,rolebinding -n postgresql-system
kubectl describe rolebinding postgresql-dashboard-metrics-reader -n postgresql-system
```
### **Port Forwarding**
Port forwarding allows you to connect to PostgreSQL from your local machine using standard database tools.
**⚠️ Important**: PostgreSQL requires SSL/TLS connections. When port forwarding, you must configure your client to handle SSL properly.
**Read-Only Replica (Load Balanced):**
```bash
# Forward to read-only service (load balances across all replicas)
kubectl port-forward -n postgresql-system svc/postgresql-shared-ro 5432:5432
# Get the password for shared_user
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo
# Connect with SSL required (recommended):
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
# Or configure your client:
# - host: localhost
# - port: 5432
# - database: shared_db
# - username: shared_user
# - password: <from secret above>
# - SSL mode: require (or disable for testing only)
```
**Specific Replica Pod:**
```bash
# List replica pods
kubectl get pods -n postgresql-system -l cnpg.io/instanceRole=replica
# Forward to specific replica pod (e.g., postgres-shared-4)
kubectl port-forward -n postgresql-system pod/postgres-shared-4 5432:5432
# Get the password for shared_user
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo
# Connect with SSL required (recommended):
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
# Or configure your client with SSL mode: require
```
**Primary (Read-Write) - For Maintenance Only:**
```bash
# Forward to read-write service (connects to primary)
kubectl port-forward -n postgresql-system svc/postgresql-shared-rw 5433:5432
# Note: Using port 5433 locally to avoid conflict if read-only is on 5432
# Get the password
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo
# Connect using localhost:5433 with SSL mode: require
```
**SSL Configuration Notes:**
- **SSL is enabled** on PostgreSQL (ssl = on)
- For **port forwarding**, clients must explicitly configure SSL mode
- The server uses self-signed certificates, so clients will need to accept untrusted certificates
- For production clients connecting directly (not via port-forward), use proper SSL with CA verification
**Troubleshooting Port Forward "Broken Pipe" Errors:**
If you see `error: lost connection to pod` or `broken pipe` errors:
1. **Use direct pod port forwarding** instead of service port forwarding (more reliable):
```bash
# List available replica pods
kubectl get pods -n postgresql-system -l cnpg.io/instanceRole=replica
# Forward to specific replica pod (more stable)
kubectl port-forward -n postgresql-system pod/postgres-shared-4 5432:5432
```
2. **Configure your client with explicit SSL mode**:
- Use `sslmode=require` in your connection string (recommended)
- Or `sslmode=prefer` (allows fallback to non-SSL if SSL fails)
- Or `sslmode=disable` for testing only (not recommended)
3. **Connection string examples**:
```bash
# With SSL required (recommended)
postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
# With SSL preferred (allows fallback)
postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=prefer
# Without SSL (testing only)
postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=disable
```
**Getting the CA Certificate (for proper SSL verification):**
```bash
# Get the CA certificate from the cluster secret
kubectl get secret postgres-shared-ca -n postgresql-system -o jsonpath='{.data.ca\.crt}' | base64 -d > postgres-ca.crt
# Use with your client:
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=verify-ca&sslrootcert=postgres-ca.crt
# Or configure your client to use the CA certificate file for SSL verification
```
### **Database Connection**
```bash
# Connect to PostgreSQL via exec
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U shared_user -d shared_db
# Check replication status
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U postgres -c "SELECT * FROM pg_stat_replication;"
# Check cluster health
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U postgres -c "SELECT pg_is_in_recovery();"
```
### **Backup & Storage**
```bash
# Check PVC status
kubectl get pvc -n postgresql-system
kubectl describe pvc postgres-shared-2 -n postgresql-system # Primary instance PVC
# Check Longhorn volumes
kubectl get volumes -n longhorn-system
kubectl describe volume -n longhorn-system | grep postgresql
```
### **Long Running Queries**
When a long running query is happening, use this command on a node
```bash
kubectl exec -n postgresql-system postgres-shared-2 -- psql -U postgres -c "
SELECT
pid,
datname,
usename,
application_name,
now() - xact_start AS tx_duration,
now() - query_start AS query_duration,
state,
wait_event_type,
wait_event,
query
FROM pg_stat_activity
WHERE state != 'idle'
AND query NOT LIKE '%pg_stat_activity%'
AND (now() - xact_start > interval '10 seconds' OR now() - query_start > interval '10 seconds')
ORDER BY GREATEST(now() - xact_start, now() - query_start) DESC;
"
```