--- description: Infrastructure components configuration and deployment patterns globs: ["manifests/infrastructure/**/*", "manifests/cluster/**/*"] alwaysApply: false --- # Infrastructure Components ✅ OPERATIONAL ## Core Infrastructure Stack Located in `manifests/infrastructure/`: - **Networking**: Cilium CNI with host firewall and Hubble UI ✅ **OPERATIONAL** - **Storage**: Longhorn distributed storage (2-replica configuration) ✅ **OPERATIONAL** - **Ingress**: NGINX Ingress Controller with hostNetwork enabled (Zero Trust mode) ✅ **OPERATIONAL** - **Zero Trust Tunnels**: Cloudflared deployment in `cloudflared-system` namespace ✅ **OPERATIONAL** - **Registry**: Harbor container registry (``) ✅ **OPERATIONAL** - **Monitoring**: OpenTelemetry Operator + OpenObserve (O2) ✅ **OPERATIONAL** - **Database**: PostgreSQL with CloudNativePG operator ✅ **OPERATIONAL** - **Identity**: Authentik open-source IAM ✅ **OPERATIONAL** - **VPN**: Tailscale mesh VPN for administrative access ✅ **OPERATIONAL** ## Component Status Matrix ### Active Components ✅ OPERATIONAL - **Cilium**: CNI with kube-proxy replacement, host firewall - **Longhorn**: Distributed storage with S3 backup to Backblaze B2 - **PostgreSQL**: 3-instance HA cluster with comprehensive monitoring - **Harbor**: Container registry (direct HTTPS - Zero Trust incompatible) - **OpenObserve**: Monitoring and observability platform - **Authentik**: Open-source identity and access management - **Renovate**: Automated dependency updates ✅ **ACTIVE** ### Disabled/Deprecated Components - **external-dns**: ❌ **REMOVED** (replaced by Zero Trust tunnels) - **cert-manager**: ❌ **REMOVED** (replaced by Cloudflare edge TLS) - **Rook-Ceph**: ⏸️ **DISABLED** (complexity - using Longhorn instead) - **Flux GitOps**: ⏸️ **DISABLED** (manual deployment - ready for re-activation) ### Development/Optional Components - **Elasticsearch**: ✅ **OPERATIONAL** (log aggregation) - **Kibana**: ✅ **OPERATIONAL** (log analytics via Zero Trust tunnel) ## Network Configuration ✅ OPERATIONAL - **NetCup Cloud vLAN**: VLAN ID 1004963 for internal cluster communication - **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA) - **Node IPs** (all control plane nodes): - n1 (152.53.107.24): Public + 10.132.0.10/24 (VLAN) - n2 (152.53.105.81): Public + 10.132.0.20/24 (VLAN) - n3 (152.53.200.111): Public + 10.132.0.30/24 (VLAN) - **DNS Domain**: Uses standard `cluster.local` for maximum compatibility - **CNI**: Cilium with kube-proxy replacement - **Service Mesh**: Cilium with Hubble for observability ## Storage Configuration ✅ OPERATIONAL ### Longhorn Storage - **Default Path**: `/var/lib/longhorn` - **Replica Count**: 2 (distributed across nodes) - **Storage Class**: `longhorn-retain` for data preservation - **S3 Backup**: Backblaze B2 integration with label-based volume selection ### S3 Backup Configuration - **Provider**: Backblaze B2 Cloud Storage - **Cost**: $6/TB storage with $0 egress fees via Cloudflare partnership - **Volume Selection**: Label-based tagging system for selective backup - **Disaster Recovery**: Automated backup scheduling and restore capabilities ## Database Configuration ✅ OPERATIONAL ### PostgreSQL with CloudNativePG - **Cluster Name**: `postgres-shared` in `postgresql-system` namespace - **High Availability**: 3-instance cluster with automatic failover - **Instances**: `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5` - **Monitoring**: Port 9187 for comprehensive metrics export - **Backup Strategy**: Integrated with S3 backup system via Longhorn volume labels ## Cache Configuration ✅ OPERATIONAL ### Redis HA Cluster - **Helm Chart**: `redis-ha` from `dandydeveloper/charts` (replaced deprecated Bitnami chart) - **Namespace**: `redis-system` - **Architecture**: 3 Redis replicas with Sentinel for HA, 3 HAProxy pods for load balancing - **Connection String**: `redis-ha-haproxy.redis-system.svc.cluster.local:6379` - **HAProxy**: Provides unified read/write endpoint managed by 3 HAProxy pods - **Storage**: Longhorn persistent volumes (20Gi per Redis instance) - **Authentication**: SOPS-encrypted credentials in `redis-credentials` secret - **Monitoring**: Redis exporter and HAProxy metrics via ServiceMonitor ### PostgreSQL Comprehensive Metrics ✅ OPERATIONAL - **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}` - **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback` - **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit` - **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version` - **Security**: Role-based access control with `pg_monitor` role for metrics collection - **Backup Integration**: Native support for WAL archiving and point-in-time recovery - **Custom Queries**: ConfigMap-based custom query system with proper RBAC permissions - **Dashboard Integration**: Native OpenObserve integration with predefined monitoring queries ## Security & Access Control ✅ ZERO TRUST ARCHITECTURE ### Zero Trust Migration ✅ COMPLETED - **Migration Status**: 10 of 11 external services migrated to Cloudflare Zero Trust tunnels - **Harbor Exception**: Direct port exposure (80/443) due to header modification issues - **Dependencies Removed**: external-dns and cert-manager no longer needed - **Security Improvement**: No external ingress ports exposed ### Tailscale Administrative Access ✅ IMPLEMENTED - **Deployment Model**: Tailscale Operator Helm Chart (v1.90.x) - **Operator**: Deployed in `tailscale-system` namespace with 2 replicas - **Subnet Router**: Connector resource advertising internal networks (Pod: 10.244.0.0/16, Service: 10.96.0.0/12, VLAN: 10.132.0.0/24) - **Magic DNS**: Services can be exposed via Tailscale operator with meta attributes for DNS resolution - **OAuth Integration**: Device authentication and tagging with `tag:k8s-operator` - **Hostname**: `keyboardvagabond-operator` for operator, `keyboardvagabond-cluster` for subnet router ## Infrastructure Deployment Patterns ### Kustomize Configuration ```yaml # Standard kustomization.yaml structure apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization namespace: component-namespace resources: - namespace.yaml - component.yaml - monitoring.yaml ``` ### Helm Integration ```yaml # HelmRelease for complex applications apiVersion: helm.toolkit.fluxcd.io/v2beta1 kind: HelmRelease metadata: name: component-name namespace: component-namespace spec: chart: spec: chart: chart-name sourceRef: kind: HelmRepository name: repo-name ``` ## Operational Procedures ### Node Addition and Scaling When adding new nodes to the cluster, specific steps are required to ensure monitoring and metrics collection continue working properly: - **Nginx Ingress Metrics**: See `docs/NODE-ADDITION-GUIDE.md` for complete procedures - Nginx ingress controller deploys automatically (DaemonSet) - OpenTelemetry collector static scrape configuration requires manual update - Must add new node IP to targets list in `manifests/infrastructure/openobserve-collector/gateway-collector.yaml` - Verification steps include checking metrics endpoints and collector logs ### Key Files for Node Operations - **Monitoring Configuration**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml` - **Network Policies**: `manifests/infrastructure/cluster-policies/host-fw-*.yaml` - **Node Addition Guide**: `docs/NODE-ADDITION-GUIDE.md` @zero-trust-ingress-template.yaml @longhorn-storage-template.yaml @postgresql-database-template.yaml