157 lines
7.6 KiB
Plaintext
157 lines
7.6 KiB
Plaintext
|
|
---
|
||
|
|
description: Infrastructure components configuration and deployment patterns
|
||
|
|
globs: ["manifests/infrastructure/**/*", "manifests/cluster/**/*"]
|
||
|
|
alwaysApply: false
|
||
|
|
---
|
||
|
|
|
||
|
|
# Infrastructure Components ✅ OPERATIONAL
|
||
|
|
|
||
|
|
## Core Infrastructure Stack
|
||
|
|
Located in `manifests/infrastructure/`:
|
||
|
|
- **Networking**: Cilium CNI with host firewall and Hubble UI ✅ **OPERATIONAL**
|
||
|
|
- **Storage**: Longhorn distributed storage (2-replica configuration) ✅ **OPERATIONAL**
|
||
|
|
- **Ingress**: NGINX Ingress Controller with hostNetwork enabled (Zero Trust mode) ✅ **OPERATIONAL**
|
||
|
|
- **Zero Trust Tunnels**: Cloudflared deployment in `cloudflared-system` namespace ✅ **OPERATIONAL**
|
||
|
|
- **Registry**: Harbor container registry (`<YOUR_REGISTRY_URL>`) ✅ **OPERATIONAL**
|
||
|
|
- **Monitoring**: OpenTelemetry Operator + OpenObserve (O2) ✅ **OPERATIONAL**
|
||
|
|
- **Database**: PostgreSQL with CloudNativePG operator ✅ **OPERATIONAL**
|
||
|
|
- **Identity**: Authentik open-source IAM ✅ **OPERATIONAL**
|
||
|
|
- **VPN**: Tailscale mesh VPN for administrative access ✅ **OPERATIONAL**
|
||
|
|
|
||
|
|
## Component Status Matrix
|
||
|
|
### Active Components ✅ OPERATIONAL
|
||
|
|
- **Cilium**: CNI with kube-proxy replacement, host firewall
|
||
|
|
- **Longhorn**: Distributed storage with S3 backup to Backblaze B2
|
||
|
|
- **PostgreSQL**: 3-instance HA cluster with comprehensive monitoring
|
||
|
|
- **Harbor**: Container registry (direct HTTPS - Zero Trust incompatible)
|
||
|
|
- **OpenObserve**: Monitoring and observability platform
|
||
|
|
- **Authentik**: Open-source identity and access management
|
||
|
|
- **Renovate**: Automated dependency updates ✅ **ACTIVE**
|
||
|
|
|
||
|
|
### Disabled/Deprecated Components
|
||
|
|
- **external-dns**: ❌ **REMOVED** (replaced by Zero Trust tunnels)
|
||
|
|
- **cert-manager**: ❌ **REMOVED** (replaced by Cloudflare edge TLS)
|
||
|
|
- **Rook-Ceph**: ⏸️ **DISABLED** (complexity - using Longhorn instead)
|
||
|
|
- **Flux GitOps**: ⏸️ **DISABLED** (manual deployment - ready for re-activation)
|
||
|
|
|
||
|
|
### Development/Optional Components
|
||
|
|
- **Elasticsearch**: ✅ **OPERATIONAL** (log aggregation)
|
||
|
|
- **Kibana**: ✅ **OPERATIONAL** (log analytics via Zero Trust tunnel)
|
||
|
|
|
||
|
|
## Network Configuration ✅ OPERATIONAL
|
||
|
|
- **NetCup Cloud vLAN**: VLAN ID 1004963 for internal cluster communication
|
||
|
|
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
|
||
|
|
- **Node IPs** (all control plane nodes):
|
||
|
|
- n1 (152.53.107.24): Public + 10.132.0.10/24 (VLAN)
|
||
|
|
- n2 (152.53.105.81): Public + 10.132.0.20/24 (VLAN)
|
||
|
|
- n3 (152.53.200.111): Public + 10.132.0.30/24 (VLAN)
|
||
|
|
- **DNS Domain**: Uses standard `cluster.local` for maximum compatibility
|
||
|
|
- **CNI**: Cilium with kube-proxy replacement
|
||
|
|
- **Service Mesh**: Cilium with Hubble for observability
|
||
|
|
|
||
|
|
## Storage Configuration ✅ OPERATIONAL
|
||
|
|
### Longhorn Storage
|
||
|
|
- **Default Path**: `/var/lib/longhorn`
|
||
|
|
- **Replica Count**: 2 (distributed across nodes)
|
||
|
|
- **Storage Class**: `longhorn-retain` for data preservation
|
||
|
|
- **S3 Backup**: Backblaze B2 integration with label-based volume selection
|
||
|
|
|
||
|
|
### S3 Backup Configuration
|
||
|
|
- **Provider**: Backblaze B2 Cloud Storage
|
||
|
|
- **Cost**: $6/TB storage with $0 egress fees via Cloudflare partnership
|
||
|
|
- **Volume Selection**: Label-based tagging system for selective backup
|
||
|
|
- **Disaster Recovery**: Automated backup scheduling and restore capabilities
|
||
|
|
|
||
|
|
## Database Configuration ✅ OPERATIONAL
|
||
|
|
### PostgreSQL with CloudNativePG
|
||
|
|
- **Cluster Name**: `postgres-shared` in `postgresql-system` namespace
|
||
|
|
- **High Availability**: 3-instance cluster with automatic failover
|
||
|
|
- **Instances**: `postgres-shared-2` (primary), `postgres-shared-4`, `postgres-shared-5`
|
||
|
|
- **Monitoring**: Port 9187 for comprehensive metrics export
|
||
|
|
- **Backup Strategy**: Integrated with S3 backup system via Longhorn volume labels
|
||
|
|
|
||
|
|
## Cache Configuration ✅ OPERATIONAL
|
||
|
|
### Redis HA Cluster
|
||
|
|
- **Helm Chart**: `redis-ha` from `dandydeveloper/charts` (replaced deprecated Bitnami chart)
|
||
|
|
- **Namespace**: `redis-system`
|
||
|
|
- **Architecture**: 3 Redis replicas with Sentinel for HA, 3 HAProxy pods for load balancing
|
||
|
|
- **Connection String**: `redis-ha-haproxy.redis-system.svc.cluster.local:6379`
|
||
|
|
- **HAProxy**: Provides unified read/write endpoint managed by 3 HAProxy pods
|
||
|
|
- **Storage**: Longhorn persistent volumes (20Gi per Redis instance)
|
||
|
|
- **Authentication**: SOPS-encrypted credentials in `redis-credentials` secret
|
||
|
|
- **Monitoring**: Redis exporter and HAProxy metrics via ServiceMonitor
|
||
|
|
|
||
|
|
### PostgreSQL Comprehensive Metrics ✅ OPERATIONAL
|
||
|
|
- **Connection Metrics**: `cnpg_backends_total`, `cnpg_pg_settings_setting{name="max_connections"}`
|
||
|
|
- **Performance Metrics**: `cnpg_pg_stat_database_xact_commit`, `cnpg_pg_stat_database_xact_rollback`
|
||
|
|
- **Storage Metrics**: `cnpg_pg_database_size_bytes`, `cnpg_pg_stat_database_blks_hit`
|
||
|
|
- **Cluster Health**: `cnpg_collector_up`, `cnpg_collector_postgres_version`
|
||
|
|
- **Security**: Role-based access control with `pg_monitor` role for metrics collection
|
||
|
|
- **Backup Integration**: Native support for WAL archiving and point-in-time recovery
|
||
|
|
- **Custom Queries**: ConfigMap-based custom query system with proper RBAC permissions
|
||
|
|
- **Dashboard Integration**: Native OpenObserve integration with predefined monitoring queries
|
||
|
|
|
||
|
|
## Security & Access Control ✅ ZERO TRUST ARCHITECTURE
|
||
|
|
### Zero Trust Migration ✅ COMPLETED
|
||
|
|
- **Migration Status**: 10 of 11 external services migrated to Cloudflare Zero Trust tunnels
|
||
|
|
- **Harbor Exception**: Direct port exposure (80/443) due to header modification issues
|
||
|
|
- **Dependencies Removed**: external-dns and cert-manager no longer needed
|
||
|
|
- **Security Improvement**: No external ingress ports exposed
|
||
|
|
|
||
|
|
### Tailscale Administrative Access ✅ IMPLEMENTED
|
||
|
|
- **Deployment Model**: Tailscale Operator Helm Chart (v1.90.x)
|
||
|
|
- **Operator**: Deployed in `tailscale-system` namespace with 2 replicas
|
||
|
|
- **Subnet Router**: Connector resource advertising internal networks (Pod: 10.244.0.0/16, Service: 10.96.0.0/12, VLAN: 10.132.0.0/24)
|
||
|
|
- **Magic DNS**: Services can be exposed via Tailscale operator with meta attributes for DNS resolution
|
||
|
|
- **OAuth Integration**: Device authentication and tagging with `tag:k8s-operator`
|
||
|
|
- **Hostname**: `keyboardvagabond-operator` for operator, `keyboardvagabond-cluster` for subnet router
|
||
|
|
|
||
|
|
## Infrastructure Deployment Patterns
|
||
|
|
### Kustomize Configuration
|
||
|
|
```yaml
|
||
|
|
# Standard kustomization.yaml structure
|
||
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
||
|
|
kind: Kustomization
|
||
|
|
namespace: component-namespace
|
||
|
|
resources:
|
||
|
|
- namespace.yaml
|
||
|
|
- component.yaml
|
||
|
|
- monitoring.yaml
|
||
|
|
```
|
||
|
|
|
||
|
|
### Helm Integration
|
||
|
|
```yaml
|
||
|
|
# HelmRelease for complex applications
|
||
|
|
apiVersion: helm.toolkit.fluxcd.io/v2beta1
|
||
|
|
kind: HelmRelease
|
||
|
|
metadata:
|
||
|
|
name: component-name
|
||
|
|
namespace: component-namespace
|
||
|
|
spec:
|
||
|
|
chart:
|
||
|
|
spec:
|
||
|
|
chart: chart-name
|
||
|
|
sourceRef:
|
||
|
|
kind: HelmRepository
|
||
|
|
name: repo-name
|
||
|
|
```
|
||
|
|
|
||
|
|
## Operational Procedures
|
||
|
|
|
||
|
|
### Node Addition and Scaling
|
||
|
|
When adding new nodes to the cluster, specific steps are required to ensure monitoring and metrics collection continue working properly:
|
||
|
|
|
||
|
|
- **Nginx Ingress Metrics**: See `docs/NODE-ADDITION-GUIDE.md` for complete procedures
|
||
|
|
- Nginx ingress controller deploys automatically (DaemonSet)
|
||
|
|
- OpenTelemetry collector static scrape configuration requires manual update
|
||
|
|
- Must add new node IP to targets list in `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
|
||
|
|
- Verification steps include checking metrics endpoints and collector logs
|
||
|
|
|
||
|
|
### Key Files for Node Operations
|
||
|
|
- **Monitoring Configuration**: `manifests/infrastructure/openobserve-collector/gateway-collector.yaml`
|
||
|
|
- **Network Policies**: `manifests/infrastructure/cluster-policies/host-fw-*.yaml`
|
||
|
|
- **Node Addition Guide**: `docs/NODE-ADDITION-GUIDE.md`
|
||
|
|
|
||
|
|
@zero-trust-ingress-template.yaml
|
||
|
|
@longhorn-storage-template.yaml
|
||
|
|
@postgresql-database-template.yaml
|