add source code and readme

This commit is contained in:
2025-12-24 14:35:17 +01:00
parent 7c92e1e610
commit 74324d5a1b
331 changed files with 39272 additions and 1 deletions

View File

@@ -0,0 +1,189 @@
---
description: Detailed technical specifications for nodes, network, and Talos configuration
globs: ["machineconfigs/**/*", "patches/**/*", "talosconfig", "kubeconfig*"]
alwaysApply: false
---
# Technical Specifications & Low-Level Configuration
## Talos Configuration ✅ OPERATIONAL
### Custom Talos Image
- **Factory Image**: `613e1592b2da41ae5e265e8789429f22e121aab91cb4deb6bc3c0b6262961245:v1.10.4`, which includes two plugins necessary for Longhorn
- **Extensions**: Longhorn extension included for distributed storage
- **Version**: Talos v1.10.4 with custom factory build
- **Architecture**: ARM64 optimized for NetCup Cloud infrastructure
### Patch Configuration
Applied via `patches/` directory for cluster customization:
- **allow-controlplane-workloads.yaml**: Enables workload scheduling on control plane
- **cluster-name.yaml**: Sets cluster name to `keyboardvagabond.com`
- **disable-kube-proxy-and-cni.yaml**: Disables built-in networking for Cilium
- **etcd-patch.yaml**: etcd optimization and configuration
- **registry-patch.yaml**: Container registry configuration
- **worker-discovery-patch.yaml**: Worker node discovery settings
## Network Configuration ✅ OPERATIONAL
### NetCup Cloud Infrastructure
- **vLAN ID**: 1004963 for internal cluster communication
- **Network Range**: 10.132.0.0/24 (private VLAN)
- **DNS Domain**: `cluster.local` (standard Kubernetes domain)
- **Cluster Name**: `keyboardvagabond.com`
### Node Network Configuration
| Node | Public IP | VLAN IP | Role | Status |
|------|-----------|---------|------|--------|
| **n1** | 152.53.107.24 | 10.132.0.10/24 | Control Plane | ✅ Schedulable |
| **n2** | 152.53.105.81 | 10.132.0.20/24 | Control Plane | ✅ Schedulable |
| **n3** | 152.53.200.111 | 10.132.0.30/24 | Control Plane | ✅ Schedulable |
- **Control Plane VIP**: `10.132.0.5` (shared VIP, nodes elect primary for HA)
- **All nodes are control plane**: High availability with etcd quorum (2 of 3 required)
### Network Interface Configuration
- **`enp7s0`**: Public interface (DHCP + static configuration)
- **`enp9s0`**: Private VLAN interface (static configuration)
- **Internal Traffic**: Uses private VLAN for pod-to-pod and storage replication
- **External Access**: Cloudflare Zero Trust tunnels (no direct port exposure)
## Administrative Access Configuration ✅ SECURED
### Kubernetes API Access
- **Internal Context**: `admin@keyboardvagabond-tailscale`
- **VIP Endpoint**: `10.132.0.5:6443` (shared VIP, recommended for HA)
- **Node Endpoints**: `10.132.0.10:6443`, `10.132.0.20:6443`, `10.132.0.30:6443` (individual nodes)
- **Public Context**: `admin@keyboardvagabond.com` (blocked by firewall)
- **Public Endpoint**: `api.keyboardvagabond.com:6443` (Tailscale-only)
- **Access Method**: Tailscale mesh VPN required (CGNAT 100.64.0.0/10)
### Talos API Access
```bash
# Talos configuration (VIP recommended for HA)
talosctl config endpoint 10.132.0.5 # VIP endpoint
talosctl config node 10.132.0.5 # VIP node
# Alternative: Individual node endpoints
talosctl config endpoint 10.132.0.10 10.132.0.20 10.132.0.30
talosctl config node 10.132.0.10 # Primary endpoint
```
### Essential Management Commands
```bash
# Cluster health check
talosctl health --nodes 10.132.0.10,10.132.0.20,10.132.0.30
# Node status
talosctl get members
# Kubernetes context switching
kubectl config use-context admin@keyboardvagabond-tailscale
# Node status verification
kubectl get nodes -o wide
```
## Storage Configuration Details ✅ OPERATIONAL
### Longhorn Distributed Storage
- **Installation Path**: `/var/lib/longhorn` on each node
- **Replica Policy**: 2-replica configuration across nodes
- **Storage Class**: `longhorn-retain` for data preservation
- **Node Allocation**: 400GB+ per node on system disk
- **Auto-balance**: Enabled for optimal distribution
### Volume Configuration
- **System Disk**: `/dev/vda` with ephemeral storage
- **Longhorn Volume**: 400GB minimum allocation per node
- **Backup Strategy**: Label-based S3 backup selection
- **Reclaim Policy**: Retain (prevents data loss)
## Tailscale Mesh VPN Configuration ✅ OPERATIONAL
### Tailscale Operator Deployment
- **Helm Chart**: `tailscale-operator` from Tailscale Helm repository
- **Version**: v1.90.x (operator v1.90.8)
- **Namespace**: `tailscale-system`
- **Replicas**: 2 operator pods with anti-affinity
- **Hostname**: `keyboardvagabond-operator`
### Subnet Router Configuration (Connector Resource)
- **Resource Type**: `Connector` (tailscale.com/v1alpha1)
- **Device Name**: `keyboardvagabond-cluster`
- **Advertised Networks**:
- **Pod Network**: 10.244.0.0/16
- **Service Network**: 10.96.0.0/12
- **VLAN Network**: 10.132.0.0/24
- **OAuth Integration**: Client credentials for device authentication
- **Device Tagging**: `tag:k8s-operator` for ACL management
### Service Exposure via Magic DNS
- **Capability**: Services can be exposed via Tailscale operator with meta attributes
- **Magic DNS**: Automatic DNS resolution for exposed services
- **Meta Attributes**: Can be used to configure service exposure and routing
- **Access Control**: Cilium host firewall restricts to Tailscale only
- **Current CGNAT Range**: 100.64.0.0/10 (Tailscale assigned)
## Component Status Matrix ✅ CURRENT STATE
### Active Components
| Component | Status | Access Method | Notes |
|-----------|--------|---------------|-------|
| **Cilium CNI** | ✅ Operational | Internal | Host firewall + Hubble UI |
| **Longhorn Storage** | ✅ Operational | Internal | 2-replica with S3 backup |
| **PostgreSQL HA** | ✅ Operational | Internal | 3-instance CloudNativePG |
| **Harbor Registry** | ✅ Operational | Direct HTTPS | Zero Trust incompatible |
| **OpenObserve** | ✅ Operational | Zero Trust | Monitoring platform |
| **Tailscale VPN** | ✅ Operational | Mesh Network | Administrative access |
### Disabled/Deprecated Components
| Component | Status | Reason | Alternative |
|-----------|--------|--------|-------------|
| **external-dns** | ❌ Removed | Zero Trust migration | Manual DNS in Cloudflare |
| **cert-manager** | ❌ Removed | Zero Trust migration | Cloudflare edge TLS |
| **Rook-Ceph** | ❌ Disabled | Complexity and lack of support for partitioning a single drive | Longhorn storage |
| **Flux GitOps** | ⏸️ Disabled | Manual deployment | Ready for re-activation |
### Development Components
| Component | Status | Purpose | Access |
|-----------|--------|---------|--------|
| **Renovate** | ✅ Operational | Dependency updates | Automated |
| **Elasticsearch** | ✅ Operational | Log aggregation | Internal |
| **Kibana** | ✅ Operational | Log analytics | Zero Trust |
## Network Security Configuration ✅ HARDENED
### Cilium Host Firewall Rules
```yaml
# Control plane API access (Tailscale only)
- fromCIDR: ["100.64.0.0/10"] # Tailscale CGNAT
toPorts: [{"port": "6443", "protocol": "TCP"}]
# Block world access to HTTP/HTTPS
- HTTP/HTTPS ports blocked from 0.0.0.0/0
- Only cluster-internal and Tailscale access permitted
```
### Zero Trust Architecture
- **External Applications**: All via Cloudflare tunnels
- **Administrative APIs**: Tailscale mesh VPN only
- **Harbor Exception**: Direct ports 80/443 (header modification issues)
- **Internal Services**: Cluster-local communication only
## Future Scaling Specifications
### Node Addition Process
1. **Network**: Add to NetCup Cloud vLAN 1004963
2. **IP Assignment**: Sequential (10.132.0.40/24, 10.132.0.50/24, etc.)
3. **Talos Config**: Apply machine config with proper networking
4. **Longhorn**: Automatic storage distribution across new nodes
5. **Workload**: Immediate scheduling capability
### High Availability Expansion
- **Additional Control Planes**: Can add for true HA setup
- **Load Balancing**: MetalLB or cloud LB integration ready
- **Database Scaling**: PostgreSQL can expand to more replicas
- **Storage Scaling**: Longhorn distributed across all nodes
@talos-machine-config-template.yaml
@cilium-network-policy-template.yaml
@longhorn-volume-template.yaml