Files
Michael DiLeo 7327d77dcd redaction (#1)
Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
2025-12-24 13:40:47 +00:00
..
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00
2025-12-24 13:40:47 +00:00

PostgreSQL Infrastructure

This directory contains the CloudNativePG setup for high-availability PostgreSQL on the Kubernetes cluster.

Architecture

  • 3 PostgreSQL instances: 1 primary + 2 replicas for high availability
  • Synchronous replication: Zero data loss (RPO=0) configuration
  • Node distribution: Instances distributed across n1, n2, and n3 nodes
  • Current cluster: postgres-shared with instances postgres-shared-2 (primary), postgres-shared-4, postgres-shared-5
  • Longhorn storage: Single replica (PostgreSQL handles replication)
  • Shared cluster: One PostgreSQL cluster that applications can share

Components

Core Components

  • namespace.yaml: PostgreSQL system namespace
  • repository.yaml: CloudNativePG Helm repository
  • operator.yaml: CloudNativePG operator deployment
  • postgresql-storageclass.yaml: Optimized storage class for PostgreSQL
  • cluster-shared.yaml: Shared PostgreSQL cluster configuration

Monitoring Components

  • postgresql-dashboard-metrics.yaml: Custom metrics ConfigMap for enhanced monitoring
  • postgresql-dashboard-rbac.yaml: RBAC permissions for metrics collection
  • Built-in ServiceMonitor: Automatically configured for OpenObserve integration

Backup Components

  • backup-config.yaml: CloudNativePG backup configuration
  • Longhorn integration: S3 backup via label-based volume selection

Services Created

CloudNativePG automatically creates these services:

  • postgresql-shared-rw: Write operations (connects to primary)
  • postgresql-shared-ro: Read-only operations (connects to replicas)
  • postgresql-shared-r: Read operations (connects to any instance)

Connection Information

For Applications

Applications should connect using these connection parameters:

Write Operations:

host: postgresql-shared-rw.postgresql-system.svc.cluster.local
port: 5432
database: shared_db
username: shared_user

Read Operations:

host: postgresql-shared-ro.postgresql-system.svc.cluster.local
port: 5432
database: shared_db
username: shared_user

Getting Credentials

The PostgreSQL password is auto-generated and stored in a secret:

# Get the password for the shared_user
kubectl get secret postgresql-shared-app -n postgresql-system -o jsonpath="{.data.password}" | base64 -d

# Get the superuser password
kubectl get secret postgresql-shared-superuser -n postgresql-system -o jsonpath="{.data.password}" | base64 -d

Application Integration Example

Here's how an application deployment would connect:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
spec:
  template:
    spec:
      containers:
      - name: app
        image: example-app:latest
        env:
        - name: DB_HOST
          value: "postgresql-shared-rw.postgresql-system.svc.cluster.local"
        - name: DB_PORT
          value: "5432"
        - name: DB_NAME
          value: "shared_db"
        - name: DB_USER
          value: "shared_user"
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: postgresql-shared-app
              key: password

Monitoring

The PostgreSQL cluster includes comprehensive monitoring and observability:

Metrics & Monitoring OPERATIONAL

  • Metrics Port: 9187 - PostgreSQL metrics endpoint
  • ServiceMonitor: Configured for OpenObserve integration
  • Built-in Metrics: CloudNativePG provides extensive default metrics including:
    • Connection Metrics: cnpg_backends_total, cnpg_pg_settings_setting{name="max_connections"}
    • Performance Metrics: cnpg_pg_stat_database_xact_commit, cnpg_pg_stat_database_xact_rollback
    • Storage Metrics: cnpg_pg_database_size_bytes, cnpg_pg_stat_database_blks_hit, cnpg_pg_stat_database_blks_read
    • Cluster Health: cnpg_collector_up, cnpg_collector_postgres_version
    • Replication: cnpg_pg_stat_replication_* metrics for streaming replication status

Custom Metrics System

  • ConfigMap Support: Custom queries can be defined via ConfigMaps
  • RBAC Configured: PostgreSQL service account has permissions to read custom metrics ConfigMaps
  • Predefined Queries: CloudNativePG includes cnpg-default-monitoring ConfigMap with standard queries
  • Monitoring Role: Uses pg_monitor role for secure metrics collection

Dashboard Integration

  • OpenObserve Ready: All metrics automatically ingested into OpenObserve
  • Key Performance Indicators:
    • Connection utilization: cnpg_backends_total / cnpg_pg_settings_setting{name="max_connections"} * 100
    • Buffer cache hit ratio: cnpg_pg_stat_database_blks_hit / (cnpg_pg_stat_database_blks_hit + cnpg_pg_stat_database_blks_read) * 100
    • Transaction rate: rate(cnpg_pg_stat_database_xact_commit[5m])
    • Rollback ratio: cnpg_pg_stat_database_xact_rollback / (cnpg_pg_stat_database_xact_commit + cnpg_pg_stat_database_xact_rollback) * 100

High Availability Monitoring

  • Automatic Failover: CloudNativePG handles primary/replica failover automatically
  • Health Checks: Continuous health monitoring with automatic recovery
  • Streaming Replication: Real-time replication status monitoring

Backup Strategy

Longhorn Storage-Level Backups (Incremental)

  • Daily backups: 2 AM UTC, retain 14 days (2 weeks)
  • Weekly backups: 1 AM Sunday, retain 8 weeks (2 months)
  • Snapshot cleanup: 3 AM daily, keep 5 local snapshots
  • Target: Backblaze B2 S3 storage via existing setup
  • Type: Incremental (efficient change block detection)

CloudNativePG Application-Level Backups

  • WAL archiving: Continuous transaction log archiving
  • Point-in-time recovery: Available via CloudNativePG
  • Retention: 30-day backup retention policy

Backup Labels

PostgreSQL volumes are automatically backed up based on labels:

backup.longhorn.io/enable: "true"
app: postgresql-shared

Scaling

To add more read replicas:

# Edit cluster-shared.yaml  
spec:
  instances: 4  # Increase from 3 to 4 for additional read replica

Troubleshooting

Cluster Status

# Check cluster status
kubectl get cluster -n postgresql-system
kubectl describe cluster postgresql-shared -n postgresql-system

# Check pods
kubectl get pods -n postgresql-system
kubectl logs postgres-shared-2 -n postgresql-system  # Current primary

Monitoring & Metrics

# Check ServiceMonitor
kubectl get servicemonitor -n postgresql-system
kubectl describe servicemonitor postgresql-shared -n postgresql-system

# Check metrics endpoint directly
kubectl port-forward -n postgresql-system postgres-shared-2 9187:9187  # Primary instance
curl http://localhost:9187/metrics

# Check custom metrics ConfigMap
kubectl get configmap -n postgresql-system
kubectl describe configmap postgresql-dashboard-metrics -n postgresql-system

# Check RBAC permissions
kubectl get role,rolebinding -n postgresql-system
kubectl describe rolebinding postgresql-dashboard-metrics-reader -n postgresql-system

Port Forwarding

Port forwarding allows you to connect to PostgreSQL from your local machine using standard database tools.

⚠️ Important: PostgreSQL requires SSL/TLS connections. When port forwarding, you must configure your client to handle SSL properly.

Read-Only Replica (Load Balanced):

# Forward to read-only service (load balances across all replicas)
kubectl port-forward -n postgresql-system svc/postgresql-shared-ro 5432:5432

# Get the password for shared_user
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo

# Connect with SSL required (recommended):
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
# Or configure your client:
#   - host: localhost
#   - port: 5432
#   - database: shared_db
#   - username: shared_user
#   - password: <from secret above>
#   - SSL mode: require (or disable for testing only)

Specific Replica Pod:

# List replica pods
kubectl get pods -n postgresql-system -l cnpg.io/instanceRole=replica

# Forward to specific replica pod (e.g., postgres-shared-4)
kubectl port-forward -n postgresql-system pod/postgres-shared-4 5432:5432

# Get the password for shared_user
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo

# Connect with SSL required (recommended):
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
# Or configure your client with SSL mode: require

Primary (Read-Write) - For Maintenance Only:

# Forward to read-write service (connects to primary)
kubectl port-forward -n postgresql-system svc/postgresql-shared-rw 5433:5432

# Note: Using port 5433 locally to avoid conflict if read-only is on 5432
# Get the password
kubectl get secret postgres-shared-app -n postgresql-system -o jsonpath='{.data.password}' | base64 -d && echo

# Connect using localhost:5433 with SSL mode: require

SSL Configuration Notes:

  • SSL is enabled on PostgreSQL (ssl = on)
  • For port forwarding, clients must explicitly configure SSL mode
  • The server uses self-signed certificates, so clients will need to accept untrusted certificates
  • For production clients connecting directly (not via port-forward), use proper SSL with CA verification

Troubleshooting Port Forward "Broken Pipe" Errors: If you see error: lost connection to pod or broken pipe errors:

  1. Use direct pod port forwarding instead of service port forwarding (more reliable):

    # List available replica pods
    kubectl get pods -n postgresql-system -l cnpg.io/instanceRole=replica
    
    # Forward to specific replica pod (more stable)
    kubectl port-forward -n postgresql-system pod/postgres-shared-4 5432:5432
    
  2. Configure your client with explicit SSL mode:

    • Use sslmode=require in your connection string (recommended)
    • Or sslmode=prefer (allows fallback to non-SSL if SSL fails)
    • Or sslmode=disable for testing only (not recommended)
  3. Connection string examples:

    # With SSL required (recommended)
    postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=require
    
    # With SSL preferred (allows fallback)
    postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=prefer
    
    # Without SSL (testing only)
    postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=disable
    

Getting the CA Certificate (for proper SSL verification):

# Get the CA certificate from the cluster secret
kubectl get secret postgres-shared-ca -n postgresql-system -o jsonpath='{.data.ca\.crt}' | base64 -d > postgres-ca.crt

# Use with your client:
# Connection string: postgresql://shared_user:<password>@localhost:5432/shared_db?sslmode=verify-ca&sslrootcert=postgres-ca.crt
# Or configure your client to use the CA certificate file for SSL verification

Database Connection

# Connect to PostgreSQL via exec
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U shared_user -d shared_db

# Check replication status
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U postgres -c "SELECT * FROM pg_stat_replication;"

# Check cluster health
kubectl exec -it postgres-shared-2 -n postgresql-system -- psql -U postgres -c "SELECT pg_is_in_recovery();"

Backup & Storage

# Check PVC status
kubectl get pvc -n postgresql-system
kubectl describe pvc postgres-shared-2 -n postgresql-system  # Primary instance PVC

# Check Longhorn volumes
kubectl get volumes -n longhorn-system
kubectl describe volume -n longhorn-system | grep postgresql

Long Running Queries

When a long running query is happening, use this command on a node

kubectl exec -n postgresql-system postgres-shared-2 -- psql -U postgres -c "
SELECT 
  pid,
  datname,
  usename,
  application_name,
  now() - xact_start AS tx_duration,
  now() - query_start AS query_duration,
  state,
  wait_event_type,
  wait_event,
  query
FROM pg_stat_activity 
WHERE state != 'idle' 
  AND query NOT LIKE '%pg_stat_activity%'
  AND (now() - xact_start > interval '10 seconds' OR now() - query_start > interval '10 seconds')
ORDER BY GREATEST(now() - xact_start, now() - query_start) DESC;
"