Files
Keybard-Vagabond-Demo/docs/openobserve-dashboard-promql-queries.md
Michael DiLeo 7327d77dcd redaction (#1)
Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
2025-12-24 13:40:47 +00:00

494 lines
14 KiB
Markdown

# OpenObserve Dashboard PromQL Queries
This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application.
## Metric Sources
Your cluster has multiple metric sources:
1. **OpenTelemetry spanmetrics** - Generates metrics from traces (`calls_total`, `latency`)
2. **Ingress-nginx** - HTTP request metrics at the ingress layer
3. **Application metrics** - Direct metrics from applications (Mastodon, BookWyrm, etc.)
## Applications
- **Mastodon** (`mastodon-application`)
- **Pixelfed** (`pixelfed-application`)
- **PieFed** (`piefed-application`)
- **BookWyrm** (`bookwyrm-application`)
- **Picsur** (`picsur`)
- **Write Freely** (`write-freely`)
---
## 1. Requests Per Second (RPS) by Application
### Using Ingress-Nginx Metrics (Recommended - Most Reliable)
```promql
# Total RPS by application (via ingress)
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# RPS by application and status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
# RPS by application and HTTP method
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method)
# RPS for specific applications
sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# RPS from spanmetrics (if service names are properly labeled)
sum(rate(calls_total[5m])) by (service_name)
# RPS by application namespace (if k8s attributes are present)
sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name)
# RPS by application and HTTP method
sum(rate(calls_total[5m])) by (service_name, http.method)
# RPS by application and status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)
```
### Combined View (All Applications)
```promql
# All applications RPS
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
```
---
## 2. Request Duration by Application
### Using Ingress-Nginx Metrics
```promql
# Average request duration by application
sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace)
# P50 (median) request duration
histogram_quantile(0.50,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P95 request duration
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P99 request duration
histogram_quantile(0.99,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# P99.9 request duration (for tail latency)
histogram_quantile(0.999,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
# Max request duration
max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# Average latency from spanmetrics
sum(rate(latency_sum[5m])) by (service_name)
/
sum(rate(latency_count[5m])) by (service_name)
# P50 latency
histogram_quantile(0.50,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# P95 latency
histogram_quantile(0.95,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# P99 latency
histogram_quantile(0.99,
sum(rate(latency_bucket[5m])) by (service_name, le)
)
# Latency by HTTP method
histogram_quantile(0.95,
sum(rate(latency_bucket[5m])) by (service_name, http.method, le)
)
```
### Response Duration (Backend Processing Time)
```promql
# Average backend response duration
sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace)
# P95 backend response duration
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)
```
---
## 3. Success Rate by Application
### Using Ingress-Nginx Metrics
```promql
# Success rate (2xx / total requests) by application
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# Success rate as percentage
(
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100
# Error rate (4xx + 5xx) by application
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
# Error rate as percentage
(
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100
# Breakdown by status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
# 5xx errors specifically
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace)
```
### Using OpenTelemetry spanmetrics
```promql
# Success rate from spanmetrics
sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)
# Error rate from spanmetrics
sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)
# Breakdown by status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)
```
---
## 4. Additional Best Practice Metrics
### Request Volume Trends
```promql
# Requests per minute (for trend analysis)
sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60
# Total requests in last hour
sum(increase(nginx_ingress_controller_requests[1h])) by (namespace)
```
### Top Endpoints
```promql
# Top endpoints by request volume
topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path))
# Top slowest endpoints (P95)
topk(10,
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le)
)
)
```
### Error Analysis
```promql
# 4xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status)
# 5xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status)
# Error rate trend (detect spikes)
rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])
```
### Throughput Metrics
```promql
# Bytes sent per second
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
# Bytes received per second
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
# Total bandwidth usage
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
+
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
```
### Connection Metrics
```promql
# Active connections
sum(nginx_ingress_controller_connections) by (ingress, namespace, state)
# Connection rate
sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state)
```
### Application-Specific Metrics
#### Mastodon
```promql
# Mastodon-specific metrics (if exposed)
sum(rate(mastodon_http_requests_total[5m])) by (method, status)
sum(rate(mastodon_http_request_duration_seconds[5m])) by (method)
```
#### BookWyrm
```promql
# BookWyrm-specific metrics (if exposed)
sum(rate(bookwyrm_requests_total[5m])) by (method, status)
```
### Database Connection Metrics (PostgreSQL)
```promql
# Active database connections by application
pg_application_connections{state="active"}
# Total connections by application
sum(pg_application_connections) by (app_name)
# Connection pool utilization
sum(pg_application_connections) by (app_name) / 100 # Adjust divisor based on max connections
```
### Celery Queue Metrics
```promql
# Queue length by application
sum(celery_queue_length{queue_name!="_total"}) by (database)
# Queue processing rate
sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60
# Stalled queues (no change in 15 minutes)
changes(celery_queue_length{queue_name="_total"}[15m]) == 0
and celery_queue_length{queue_name="_total"} > 100
```
#### Redis-Backed Queue Dashboard Panels
Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the `celery-metrics-exporter` in the `celery-monitoring` namespace.
- **Queue Depth per Queue (stacked area or line)**
```promql
sum by (database, queue_name) (
celery_queue_length{
queue_name!~"_total|_staging",
database=~"piefed|bookwyrm|mastodon"
}
)
```
This shows the absolute number of pending items in every discovered queue. Filter the `database` regex if you only want a single app. Switch the panel legend to `{{database}}/{{queue_name}}` so per-queue trends stand out.
- **Processing Rate per Queue (tasks/minute)**
```promql
-60 * sum by (database, queue_name) (
rate(
celery_queue_length{
queue_name!~"_total|_staging",
database=~"piefed|bookwyrm|mastodon"
}[5m]
)
)
```
The queue length decreases when workers drain tasks, so multiply the `rate()` by `-60` to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck.
> **Fallback**: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric `redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}`. Replace `celery_queue_length` with `redis_list_length` in both queries and keep the rest of the panel configuration identical.
An import-ready OpenObserve dashboard that contains these two panels lives at `docs/dashboards/openobserve-redis-queue-dashboard.json`. Import it via *Dashboards → Import* to jump-start the rebuild after a disaster recovery.
### Redis Metrics
```promql
# Redis connection status
redis_connection_status
# Redis memory usage (if available)
redis_memory_used_bytes
```
### Pod/Container Metrics
```promql
# CPU usage by application
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
# Memory usage by application
sum(container_memory_working_set_bytes) by (namespace, pod)
# Pod restarts
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod)
```
---
## 5. Dashboard Panel Recommendations
### Panel 1: Overview
- **Total RPS** (all applications)
- **Total Error Rate** (all applications)
- **Average Response Time** (P95, all applications)
### Panel 2: Per-Application RPS
- Time series graph showing RPS for each application
- Use `sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)`
### Panel 3: Per-Application Latency
- P50, P95, P99 latency for each application
- Use histogram quantiles from ingress-nginx metrics
### Panel 4: Success/Error Rates
- Success rate (2xx) by application
- Error rate (4xx + 5xx) by application
- Status code breakdown
### Panel 5: Top Endpoints
- Top 10 endpoints by volume
- Top 10 slowest endpoints
### Panel 6: Database Health
- Active connections by application
- Connection pool utilization
### Panel 7: Queue Health (Celery)
- Queue lengths by application
- Processing rates
### Panel 8: Resource Usage
- CPU usage by application
- Memory usage by application
- Pod restart counts
---
## 6. Alerting Queries
### High Error Rate
```promql
# Alert if error rate > 5% for any application
(
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) > 0.05
```
### High Latency
```promql
# Alert if P95 latency > 2 seconds
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le)
) > 2
```
### Low Success Rate
```promql
# Alert if success rate < 95%
(
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) < 0.95
```
### High Request Volume (Spike Detection)
```promql
# Alert if RPS increases by 3x in 5 minutes
rate(nginx_ingress_controller_requests[5m])
>
3 * rate(nginx_ingress_controller_requests[5m] offset 5m)
```
---
## 7. Notes on Metric Naming
- **Ingress-nginx metrics** are the most reliable for HTTP request metrics
- **spanmetrics** may have different label names depending on k8s attribute processor configuration
- Check actual metric names in OpenObserve using: `{__name__=~".*request.*|.*http.*|.*latency.*"}`
- Service names from spanmetrics may need to be mapped to application names
## 8. Troubleshooting
If metrics don't appear:
1. **Check ServiceMonitors are active:**
```bash
kubectl get servicemonitors -A
```
2. **Verify Prometheus receiver is scraping:**
Check OpenTelemetry collector logs for scraping errors
3. **Verify metric names:**
Query OpenObserve for available metrics:
```promql
{__name__=~".*"}
```
4. **Check label names:**
The actual label names may vary. Common variations:
- `namespace` vs `k8s.namespace.name`
- `service_name` vs `service.name`
- `ingress` vs `ingress_name`
---
## Quick Reference: Application Namespaces
- Mastodon: `mastodon-application`
- Pixelfed: `pixelfed-application`
- PieFed: `piefed-application`
- BookWyrm: `bookwyrm-application`
- Picsur: `picsur`
- Write Freely: `write-freely`