manifests/infrastructure/metrics-server/README.md

# Kubernetes Metrics Server

## Overview
This deploys the Kubernetes Metrics Server to provide resource metrics for nodes and pods. The metrics server enables `kubectl top` commands and provides metrics for Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA).

## Architecture

### Current Deployment (Simple)
- **Version**: v0.7.2 (latest stable)
- **Replicas**: 2 (HA across both cluster nodes)
- **TLS Mode**: Insecure TLS for initial deployment (`--kubelet-insecure-tls=true`)
- **Integration**: OpenObserve monitoring via ServiceMonitor

### Security Configuration
The current deployment uses `--kubelet-insecure-tls=true` for compatibility with Talos Linux. This is acceptable for internal cluster metrics as:
- Metrics traffic stays within the cluster network
- The VLAN provides network isolation 
- No sensitive data is exposed via metrics
- Proper RBAC controls access to the metrics API

### Future Enhancements (Optional)
For production hardening, the repository includes:
- `certificate.yaml`: cert-manager certificates for proper TLS
- `metrics-server.yaml`: Full TLS-enabled deployment
- Switch to secure TLS by updating kustomization.yaml when needed

## Usage

### Basic Commands
```bash
# View node resource usage
kubectl top nodes

# View pod resource usage (all namespaces)
kubectl top pods --all-namespaces

# View pod resource usage (specific namespace)
kubectl top pods -n kube-system

# View pod resource usage with containers
kubectl top pods --containers
```

### Integration with Monitoring
The metrics server is automatically discovered by OpenObserve via ServiceMonitor for:
- Metrics server performance monitoring
- Resource usage dashboards
- Alerting on high resource consumption

## Troubleshooting

### Common Issues
1. **"Metrics API not available"**: Check pod status with `kubectl get pods -n metrics-server-system`
2. **TLS certificate errors**: Verify APIService with `kubectl get apiservice v1beta1.metrics.k8s.io`
3. **Resource limits**: Pods may be OOMKilled if cluster load is high

### Verification
```bash
# Check metrics server status
kubectl get pods -n metrics-server-system

# Verify API registration
kubectl get apiservice v1beta1.metrics.k8s.io

# Test metrics collection
kubectl top nodes
kubectl top pods -n metrics-server-system
```

## Configuration

### Resource Requests/Limits
- **CPU**: 100m request, 500m limit
- **Memory**: 200Mi request, 500Mi limit
- **Priority**: system-cluster-critical

### Node Scheduling
- Tolerates control plane taints
- Can schedule on both n1 (control plane) and n2 (worker)
- Uses node selector for Linux nodes only

## Monitoring Integration
- **ServiceMonitor**: Automatically scraped by OpenObserve
- **Metrics Path**: `/metrics` on HTTPS port
- **Scrape Interval**: 30 seconds
- **Dashboard**: Available in OpenObserve for resource analysis
redaction (#1) Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me> 2025-12-24 13:40:47 +00:00			`# Kubernetes Metrics Server`

			`## Overview`
			This deploys the Kubernetes Metrics Server to provide resource metrics for nodes and pods. The metrics server enables `kubectl top` commands and provides metrics for Horizontal Pod Autoscaling (HPA) and Vertical Pod Autoscaling (VPA).

			`## Architecture`

			`### Current Deployment (Simple)`
			`- Version: v0.7.2 (latest stable)`
			`- Replicas: 2 (HA across both cluster nodes)`
			- TLS Mode: Insecure TLS for initial deployment (`--kubelet-insecure-tls=true`)
			`- Integration: OpenObserve monitoring via ServiceMonitor`

			`### Security Configuration`
			The current deployment uses `--kubelet-insecure-tls=true` for compatibility with Talos Linux. This is acceptable for internal cluster metrics as:
			`- Metrics traffic stays within the cluster network`
			`- The VLAN provides network isolation`
			`- No sensitive data is exposed via metrics`
			`- Proper RBAC controls access to the metrics API`

			`### Future Enhancements (Optional)`
			`For production hardening, the repository includes:`
			- `certificate.yaml`: cert-manager certificates for proper TLS
			- `metrics-server.yaml`: Full TLS-enabled deployment
			`- Switch to secure TLS by updating kustomization.yaml when needed`

			`## Usage`

			`### Basic Commands`
			```bash
			`# View node resource usage`
			`kubectl top nodes`

			`# View pod resource usage (all namespaces)`
			`kubectl top pods --all-namespaces`

			`# View pod resource usage (specific namespace)`
			`kubectl top pods -n kube-system`

			`# View pod resource usage with containers`
			`kubectl top pods --containers`
			```

			`### Integration with Monitoring`
			`The metrics server is automatically discovered by OpenObserve via ServiceMonitor for:`
			`- Metrics server performance monitoring`
			`- Resource usage dashboards`
			`- Alerting on high resource consumption`

			`## Troubleshooting`

			`### Common Issues`
			1. "Metrics API not available": Check pod status with `kubectl get pods -n metrics-server-system`
			2. TLS certificate errors: Verify APIService with `kubectl get apiservice v1beta1.metrics.k8s.io`
			`3. Resource limits: Pods may be OOMKilled if cluster load is high`

			`### Verification`
			```bash
			`# Check metrics server status`
			`kubectl get pods -n metrics-server-system`

			`# Verify API registration`
			`kubectl get apiservice v1beta1.metrics.k8s.io`

			`# Test metrics collection`
			`kubectl top nodes`
			`kubectl top pods -n metrics-server-system`
			```

			`## Configuration`

			`### Resource Requests/Limits`
			`- CPU: 100m request, 500m limit`
			`- Memory: 200Mi request, 500Mi limit`
			`- Priority: system-cluster-critical`

			`### Node Scheduling`
			`- Tolerates control plane taints`
			`- Can schedule on both n1 (control plane) and n2 (worker)`
			`- Uses node selector for Linux nodes only`

			`## Monitoring Integration`
			`- ServiceMonitor: Automatically scraped by OpenObserve`
			- Metrics Path: `/metrics` on HTTPS port
			`- Scrape Interval: 30 seconds`
			`- Dashboard: Available in OpenObserve for resource analysis`