# OpenObserve Dashboard PromQL Queries This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application. ## Metric Sources Your cluster has multiple metric sources: 1. **OpenTelemetry spanmetrics** - Generates metrics from traces (`calls_total`, `latency`) 2. **Ingress-nginx** - HTTP request metrics at the ingress layer 3. **Application metrics** - Direct metrics from applications (Mastodon, BookWyrm, etc.) ## Applications - **Mastodon** (`mastodon-application`) - **Pixelfed** (`pixelfed-application`) - **PieFed** (`piefed-application`) - **BookWyrm** (`bookwyrm-application`) - **Picsur** (`picsur`) - **Write Freely** (`write-freely`) --- ## 1. Requests Per Second (RPS) by Application ### Using Ingress-Nginx Metrics (Recommended - Most Reliable) ```promql # Total RPS by application (via ingress) sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace) # RPS by application and status code sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status) # RPS by application and HTTP method sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method) # RPS for specific applications sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace) ``` ### Using OpenTelemetry spanmetrics ```promql # RPS from spanmetrics (if service names are properly labeled) sum(rate(calls_total[5m])) by (service_name) # RPS by application namespace (if k8s attributes are present) sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name) # RPS by application and HTTP method sum(rate(calls_total[5m])) by (service_name, http.method) # RPS by application and status code sum(rate(calls_total[5m])) by (service_name, http.status_code) ``` ### Combined View (All Applications) ```promql # All applications RPS sum(rate(nginx_ingress_controller_requests[5m])) by (namespace) ``` --- ## 2. Request Duration by Application ### Using Ingress-Nginx Metrics ```promql # Average request duration by application sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace) # P50 (median) request duration histogram_quantile(0.50, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le) ) # P95 request duration histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le) ) # P99 request duration histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le) ) # P99.9 request duration (for tail latency) histogram_quantile(0.999, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le) ) # Max request duration max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace) ``` ### Using OpenTelemetry spanmetrics ```promql # Average latency from spanmetrics sum(rate(latency_sum[5m])) by (service_name) / sum(rate(latency_count[5m])) by (service_name) # P50 latency histogram_quantile(0.50, sum(rate(latency_bucket[5m])) by (service_name, le) ) # P95 latency histogram_quantile(0.95, sum(rate(latency_bucket[5m])) by (service_name, le) ) # P99 latency histogram_quantile(0.99, sum(rate(latency_bucket[5m])) by (service_name, le) ) # Latency by HTTP method histogram_quantile(0.95, sum(rate(latency_bucket[5m])) by (service_name, http.method, le) ) ``` ### Response Duration (Backend Processing Time) ```promql # Average backend response duration sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace) # P95 backend response duration histogram_quantile(0.95, sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le) ) ``` --- ## 3. Success Rate by Application ### Using Ingress-Nginx Metrics ```promql # Success rate (2xx / total requests) by application sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace) # Success rate as percentage ( sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace) ) * 100 # Error rate (4xx + 5xx) by application sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace) # Error rate as percentage ( sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace) ) * 100 # Breakdown by status code sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status) # 5xx errors specifically sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace) ``` ### Using OpenTelemetry spanmetrics ```promql # Success rate from spanmetrics sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name) / sum(rate(calls_total[5m])) by (service_name) # Error rate from spanmetrics sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name) / sum(rate(calls_total[5m])) by (service_name) # Breakdown by status code sum(rate(calls_total[5m])) by (service_name, http.status_code) ``` --- ## 4. Additional Best Practice Metrics ### Request Volume Trends ```promql # Requests per minute (for trend analysis) sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60 # Total requests in last hour sum(increase(nginx_ingress_controller_requests[1h])) by (namespace) ``` ### Top Endpoints ```promql # Top endpoints by request volume topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path)) # Top slowest endpoints (P95) topk(10, histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le) ) ) ``` ### Error Analysis ```promql # 4xx errors by application sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status) # 5xx errors by application sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status) # Error rate trend (detect spikes) rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m]) ``` ### Throughput Metrics ```promql # Bytes sent per second sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace) # Bytes received per second sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace) # Total bandwidth usage sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace) + sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace) ``` ### Connection Metrics ```promql # Active connections sum(nginx_ingress_controller_connections) by (ingress, namespace, state) # Connection rate sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state) ``` ### Application-Specific Metrics #### Mastodon ```promql # Mastodon-specific metrics (if exposed) sum(rate(mastodon_http_requests_total[5m])) by (method, status) sum(rate(mastodon_http_request_duration_seconds[5m])) by (method) ``` #### BookWyrm ```promql # BookWyrm-specific metrics (if exposed) sum(rate(bookwyrm_requests_total[5m])) by (method, status) ``` ### Database Connection Metrics (PostgreSQL) ```promql # Active database connections by application pg_application_connections{state="active"} # Total connections by application sum(pg_application_connections) by (app_name) # Connection pool utilization sum(pg_application_connections) by (app_name) / 100 # Adjust divisor based on max connections ``` ### Celery Queue Metrics ```promql # Queue length by application sum(celery_queue_length{queue_name!="_total"}) by (database) # Queue processing rate sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60 # Stalled queues (no change in 15 minutes) changes(celery_queue_length{queue_name="_total"}[15m]) == 0 and celery_queue_length{queue_name="_total"} > 100 ``` #### Redis-Backed Queue Dashboard Panels Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the `celery-metrics-exporter` in the `celery-monitoring` namespace. - **Queue Depth per Queue (stacked area or line)** ```promql sum by (database, queue_name) ( celery_queue_length{ queue_name!~"_total|_staging", database=~"piefed|bookwyrm|mastodon" } ) ``` This shows the absolute number of pending items in every discovered queue. Filter the `database` regex if you only want a single app. Switch the panel legend to `{{database}}/{{queue_name}}` so per-queue trends stand out. - **Processing Rate per Queue (tasks/minute)** ```promql -60 * sum by (database, queue_name) ( rate( celery_queue_length{ queue_name!~"_total|_staging", database=~"piefed|bookwyrm|mastodon" }[5m] ) ) ``` The queue length decreases when workers drain tasks, so multiply the `rate()` by `-60` to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck. > **Fallback**: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric `redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}`. Replace `celery_queue_length` with `redis_list_length` in both queries and keep the rest of the panel configuration identical. An import-ready OpenObserve dashboard that contains these two panels lives at `docs/dashboards/openobserve-redis-queue-dashboard.json`. Import it via *Dashboards → Import* to jump-start the rebuild after a disaster recovery. ### Redis Metrics ```promql # Redis connection status redis_connection_status # Redis memory usage (if available) redis_memory_used_bytes ``` ### Pod/Container Metrics ```promql # CPU usage by application sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod) # Memory usage by application sum(container_memory_working_set_bytes) by (namespace, pod) # Pod restarts sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod) ``` --- ## 5. Dashboard Panel Recommendations ### Panel 1: Overview - **Total RPS** (all applications) - **Total Error Rate** (all applications) - **Average Response Time** (P95, all applications) ### Panel 2: Per-Application RPS - Time series graph showing RPS for each application - Use `sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)` ### Panel 3: Per-Application Latency - P50, P95, P99 latency for each application - Use histogram quantiles from ingress-nginx metrics ### Panel 4: Success/Error Rates - Success rate (2xx) by application - Error rate (4xx + 5xx) by application - Status code breakdown ### Panel 5: Top Endpoints - Top 10 endpoints by volume - Top 10 slowest endpoints ### Panel 6: Database Health - Active connections by application - Connection pool utilization ### Panel 7: Queue Health (Celery) - Queue lengths by application - Processing rates ### Panel 8: Resource Usage - CPU usage by application - Memory usage by application - Pod restart counts --- ## 6. Alerting Queries ### High Error Rate ```promql # Alert if error rate > 5% for any application ( sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (namespace) ) > 0.05 ``` ### High Latency ```promql # Alert if P95 latency > 2 seconds histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le) ) > 2 ``` ### Low Success Rate ```promql # Alert if success rate < 95% ( sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace) / sum(rate(nginx_ingress_controller_requests[5m])) by (namespace) ) < 0.95 ``` ### High Request Volume (Spike Detection) ```promql # Alert if RPS increases by 3x in 5 minutes rate(nginx_ingress_controller_requests[5m]) > 3 * rate(nginx_ingress_controller_requests[5m] offset 5m) ``` --- ## 7. Notes on Metric Naming - **Ingress-nginx metrics** are the most reliable for HTTP request metrics - **spanmetrics** may have different label names depending on k8s attribute processor configuration - Check actual metric names in OpenObserve using: `{__name__=~".*request.*|.*http.*|.*latency.*"}` - Service names from spanmetrics may need to be mapped to application names ## 8. Troubleshooting If metrics don't appear: 1. **Check ServiceMonitors are active:** ```bash kubectl get servicemonitors -A ``` 2. **Verify Prometheus receiver is scraping:** Check OpenTelemetry collector logs for scraping errors 3. **Verify metric names:** Query OpenObserve for available metrics: ```promql {__name__=~".*"} ``` 4. **Check label names:** The actual label names may vary. Common variations: - `namespace` vs `k8s.namespace.name` - `service_name` vs `service.name` - `ingress` vs `ingress_name` --- ## Quick Reference: Application Namespaces - Mastodon: `mastodon-application` - Pixelfed: `pixelfed-application` - PieFed: `piefed-application` - BookWyrm: `bookwyrm-application` - Picsur: `picsur` - Write Freely: `write-freely`