Keybard-Vagabond-Demo/.cursor/rules/infrastructure.mdc

---
description: Infrastructure components configuration and deployment patterns
globs: ["manifests/infrastructure/**/*", "manifests/cluster/**/*"]
alwaysApply: false
---

# Infrastructure Components ✅ OPERATIONAL

## Core Infrastructure Stack
Located in `manifests/infrastructure/`:
- **Networking**: Cilium CNI with host firewall and Hubble UI ✅ **OPERATIONAL**
- **Storage**: Longhorn distributed storage (2-replica configuration) ✅ **OPERATIONAL**
- **Ingress**: NGINX Ingress Controller with hostNetwork enabled (Zero Trust mode) ✅ **OPERATIONAL**
- **Zero Trust Tunnels**: Cloudflared deployment in `cloudflared-system` namespace ✅ **OPERATIONAL**
- **Registry**: Harbor container registry (`<YOUR_REGISTRY_URL>`) ✅ **OPERATIONAL**
- **Monitoring**: OpenTelemetry Operator + OpenObserve (O2) ✅ **OPERATIONAL**
- **Database**: PostgreSQL with CloudNativePG operator ✅ **OPERATIONAL**
- **Identity**: Authentik open-source IAM ✅ **OPERATIONAL**
- **VPN**: Tailscale mesh VPN for administrative access ✅ **OPERATIONAL**

## Component Status Matrix
### Active Components ✅ OPERATIONAL
- **Cilium**: CNI with kube-proxy replacement, host firewall
- **Longhorn**: Distributed storage with S3 backup to Backblaze B2
- **PostgreSQL**: 3-instance HA cluster with comprehensive monitoring
- **Harbor**: Container registry (direct HTTPS - Zero Trust incompatible)
- **OpenObserve**: Monitoring and observability platform
- **Authentik**: Open-source identity and access management
- **Renovate**: Automated dependency updates ✅ **ACTIVE**

### Disabled/Deprecated Components
- **external-dns**: ❌ **REMOVED** (replaced by Zero Trust tunnels)
- **cert-manager**: ❌ **REMOVED** (replaced by Cloudflare edge TLS)
- **Rook-Ceph**: ⏸️ **DISABLED** (complexity - using Longhorn instead)
- **Flux GitOps**: ⏸️ **DISABLED** (manual deployment - ready for re-activation)

### Development/Optional Components
- **Elasticsearch**: ✅ **OPERATIONAL** (log aggregation)
- **Kibana**: ✅ **OPERATIONAL** (log analytics via Zero Trust tunnel)

## Network Configuration ✅ OPERATIONAL
- **NetCup Cloud vLAN**: VLAN ID 1004963 for internal cluster communication
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
- **Node IPs** (all control plane nodes):
  - n1 (152.53.107.24): Public + 10.132.0.10/24 (VLAN)
  - n2 (152.53.105.81): Public + 10.132.0.20/24 (VLAN)
  - n3 (152.53.200.111): Public + 10.132.0.30/24 (VLAN)
- **DNS Domain**: Uses standard `cluster.local` for maximum compatibility
- **CNI**: Cilium with kube-proxy replacement
- **Service Mesh**: Cilium with Hubble for observability

## Storage Configuration ✅ OPERATIONAL
### Longhorn Storage
- **Default Path**: `/var/lib/longhorn`
- **Replica Count**: 2 (distributed across nodes)
- **Storage Class**: `longhorn-retain` for data preservation
- **S3 Backup**: Backblaze B2 integration with label-based volume selection

### S3 Backup Configuration
- **Provider**: Backblaze B2 Cloud Storage
- **Cost**: $6/TB storage with $0 egress fees via Cloudflare partnership
- **Volume Selection**: Label-based tagging system for selective backup
- **Disaster Recovery**: Automated backup scheduling and restore capabilities

## Database Configuration ✅ OPERATIONAL
### PostgreSQL with CloudNativePG
- **Cluster Name**: `postgres-shared` in `postgresql-system` namespace
- **High Availability**: 3-instance cluster with automatic failover
- **Instances**: `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5`
- **Monitoring**: Port 9187 for comprehensive metrics export
- **Backup Strategy**: Integrated with S3 backup system via Longhorn volume labels

## Cache Configuration ✅ OPERATIONAL
### Redis HA Cluster
- **Helm Chart**: `redis-ha` from `dandydeveloper/charts` (replaced deprecated Bitnami chart)
- **Namespace**: `redis-system`
- **Architecture**: 3 Redis replicas with Sentinel for HA, 3 HAProxy pods for load balancing
- **Connection String**: `redis-ha-haproxy.redis-system.svc.cluster.local:6379`
- **HAProxy**: Provides unified read/write endpoint managed by 3 HAProxy pods
- **Storage**: Longhorn persistent volumes (20Gi per Redis instance)
- **Authentication**: SOPS-encrypted credentials in `redis-credentials` secret
- **Monitoring**: Redis exporter and HAProxy metrics via ServiceMonitor

### PostgreSQL Comprehensive Metrics ✅ OPERATIONAL
- **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}`
- **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback`
- **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit`
- **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version`
- **Security**: Role-based access control with `pg_monitor` role for metrics collection
- **Backup Integration**: Native support for WAL archiving and point-in-time recovery
- **Custom Queries**: ConfigMap-based custom query system with proper RBAC permissions
- **Dashboard Integration**: Native OpenObserve integration with predefined monitoring queries

## Security & Access Control ✅ ZERO TRUST ARCHITECTURE
### Zero Trust Migration ✅ COMPLETED
- **Migration Status**: 10 of 11 external services migrated to Cloudflare Zero Trust tunnels
- **Harbor Exception**: Direct port exposure (80/443) due to header modification issues
- **Dependencies Removed**: external-dns and cert-manager no longer needed
- **Security Improvement**: No external ingress ports exposed

### Tailscale Administrative Access ✅ IMPLEMENTED
- **Deployment Model**: Tailscale Operator Helm Chart (v1.90.x)
- **Operator**: Deployed in `tailscale-system` namespace with 2 replicas
- **Subnet Router**: Connector resource advertising internal networks (Pod: 10.244.0.0/16, Service: 10.96.0.0/12, VLAN: 10.132.0.0/24)
- **Magic DNS**: Services can be exposed via Tailscale operator with meta attributes for DNS resolution
- **OAuth Integration**: Device authentication and tagging with `tag:k8s-operator`
- **Hostname**: `keyboardvagabond-operator` for operator, `keyboardvagabond-cluster` for subnet router

## Infrastructure Deployment Patterns
### Kustomize Configuration
```yaml
# Standard kustomization.yaml structure
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: component-namespace
resources:
  - namespace.yaml
  - component.yaml
  - monitoring.yaml
```

### Helm Integration
```yaml
# HelmRelease for complex applications
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: component-name
  namespace: component-namespace
spec:
  chart:
    spec:
      chart: chart-name
      sourceRef:
        kind: HelmRepository
        name: repo-name
```

## Operational Procedures

### Node Addition and Scaling
When adding new nodes to the cluster, specific steps are required to ensure monitoring and metrics collection continue working properly:

- **Nginx Ingress Metrics**: See `docs/NODE-ADDITION-GUIDE.md` for complete procedures
  - Nginx ingress controller deploys automatically (DaemonSet)
  - OpenTelemetry collector static scrape configuration requires manual update
  - Must add new node IP to targets list in `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
  - Verification steps include checking metrics endpoints and collector logs

### Key Files for Node Operations
- **Monitoring Configuration**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
- **Network Policies**: `manifests/infrastructure/cluster-policies/host-fw-*.yaml`
- **Node Addition Guide**: `docs/NODE-ADDITION-GUIDE.md`

@zero-trust-ingress-template.yaml
@longhorn-storage-template.yaml
@postgresql-database-template.yaml