494 lines
14 KiB
Markdown
494 lines
14 KiB
Markdown
# OpenObserve Dashboard PromQL Queries
|
|
|
|
This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application.
|
|
|
|
## Metric Sources
|
|
|
|
Your cluster has multiple metric sources:
|
|
1. **OpenTelemetry spanmetrics** - Generates metrics from traces (`calls_total`, `latency`)
|
|
2. **Ingress-nginx** - HTTP request metrics at the ingress layer
|
|
3. **Application metrics** - Direct metrics from applications (Mastodon, BookWyrm, etc.)
|
|
|
|
## Applications
|
|
|
|
- **Mastodon** (`mastodon-application`)
|
|
- **Pixelfed** (`pixelfed-application`)
|
|
- **PieFed** (`piefed-application`)
|
|
- **BookWyrm** (`bookwyrm-application`)
|
|
- **Picsur** (`picsur`)
|
|
- **Write Freely** (`write-freely`)
|
|
|
|
---
|
|
|
|
## 1. Requests Per Second (RPS) by Application
|
|
|
|
### Using Ingress-Nginx Metrics (Recommended - Most Reliable)
|
|
|
|
```promql
|
|
# Total RPS by application (via ingress)
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
|
|
|
# RPS by application and status code
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
|
|
|
|
# RPS by application and HTTP method
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method)
|
|
|
|
# RPS for specific applications
|
|
sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace)
|
|
```
|
|
|
|
### Using OpenTelemetry spanmetrics
|
|
|
|
```promql
|
|
# RPS from spanmetrics (if service names are properly labeled)
|
|
sum(rate(calls_total[5m])) by (service_name)
|
|
|
|
# RPS by application namespace (if k8s attributes are present)
|
|
sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name)
|
|
|
|
# RPS by application and HTTP method
|
|
sum(rate(calls_total[5m])) by (service_name, http.method)
|
|
|
|
# RPS by application and status code
|
|
sum(rate(calls_total[5m])) by (service_name, http.status_code)
|
|
```
|
|
|
|
### Combined View (All Applications)
|
|
|
|
```promql
|
|
# All applications RPS
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Request Duration by Application
|
|
|
|
### Using Ingress-Nginx Metrics
|
|
|
|
```promql
|
|
# Average request duration by application
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace)
|
|
|
|
# P50 (median) request duration
|
|
histogram_quantile(0.50,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
|
)
|
|
|
|
# P95 request duration
|
|
histogram_quantile(0.95,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
|
)
|
|
|
|
# P99 request duration
|
|
histogram_quantile(0.99,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
|
)
|
|
|
|
# P99.9 request duration (for tail latency)
|
|
histogram_quantile(0.999,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
|
)
|
|
|
|
# Max request duration
|
|
max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace)
|
|
```
|
|
|
|
### Using OpenTelemetry spanmetrics
|
|
|
|
```promql
|
|
# Average latency from spanmetrics
|
|
sum(rate(latency_sum[5m])) by (service_name)
|
|
/
|
|
sum(rate(latency_count[5m])) by (service_name)
|
|
|
|
# P50 latency
|
|
histogram_quantile(0.50,
|
|
sum(rate(latency_bucket[5m])) by (service_name, le)
|
|
)
|
|
|
|
# P95 latency
|
|
histogram_quantile(0.95,
|
|
sum(rate(latency_bucket[5m])) by (service_name, le)
|
|
)
|
|
|
|
# P99 latency
|
|
histogram_quantile(0.99,
|
|
sum(rate(latency_bucket[5m])) by (service_name, le)
|
|
)
|
|
|
|
# Latency by HTTP method
|
|
histogram_quantile(0.95,
|
|
sum(rate(latency_bucket[5m])) by (service_name, http.method, le)
|
|
)
|
|
```
|
|
|
|
### Response Duration (Backend Processing Time)
|
|
|
|
```promql
|
|
# Average backend response duration
|
|
sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace)
|
|
|
|
# P95 backend response duration
|
|
histogram_quantile(0.95,
|
|
sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Success Rate by Application
|
|
|
|
### Using Ingress-Nginx Metrics
|
|
|
|
```promql
|
|
# Success rate (2xx / total requests) by application
|
|
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
|
|
|
# Success rate as percentage
|
|
(
|
|
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
|
) * 100
|
|
|
|
# Error rate (4xx + 5xx) by application
|
|
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
|
|
|
# Error rate as percentage
|
|
(
|
|
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
|
) * 100
|
|
|
|
# Breakdown by status code
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
|
|
|
|
# 5xx errors specifically
|
|
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace)
|
|
```
|
|
|
|
### Using OpenTelemetry spanmetrics
|
|
|
|
```promql
|
|
# Success rate from spanmetrics
|
|
sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name)
|
|
/
|
|
sum(rate(calls_total[5m])) by (service_name)
|
|
|
|
# Error rate from spanmetrics
|
|
sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name)
|
|
/
|
|
sum(rate(calls_total[5m])) by (service_name)
|
|
|
|
# Breakdown by status code
|
|
sum(rate(calls_total[5m])) by (service_name, http.status_code)
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Additional Best Practice Metrics
|
|
|
|
### Request Volume Trends
|
|
|
|
```promql
|
|
# Requests per minute (for trend analysis)
|
|
sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60
|
|
|
|
# Total requests in last hour
|
|
sum(increase(nginx_ingress_controller_requests[1h])) by (namespace)
|
|
```
|
|
|
|
### Top Endpoints
|
|
|
|
```promql
|
|
# Top endpoints by request volume
|
|
topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path))
|
|
|
|
# Top slowest endpoints (P95)
|
|
topk(10,
|
|
histogram_quantile(0.95,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le)
|
|
)
|
|
)
|
|
```
|
|
|
|
### Error Analysis
|
|
|
|
```promql
|
|
# 4xx errors by application
|
|
sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status)
|
|
|
|
# 5xx errors by application
|
|
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status)
|
|
|
|
# Error rate trend (detect spikes)
|
|
rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])
|
|
```
|
|
|
|
### Throughput Metrics
|
|
|
|
```promql
|
|
# Bytes sent per second
|
|
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
|
|
|
|
# Bytes received per second
|
|
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
|
|
|
|
# Total bandwidth usage
|
|
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
|
|
+
|
|
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
|
|
```
|
|
|
|
### Connection Metrics
|
|
|
|
```promql
|
|
# Active connections
|
|
sum(nginx_ingress_controller_connections) by (ingress, namespace, state)
|
|
|
|
# Connection rate
|
|
sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state)
|
|
```
|
|
|
|
### Application-Specific Metrics
|
|
|
|
#### Mastodon
|
|
|
|
```promql
|
|
# Mastodon-specific metrics (if exposed)
|
|
sum(rate(mastodon_http_requests_total[5m])) by (method, status)
|
|
sum(rate(mastodon_http_request_duration_seconds[5m])) by (method)
|
|
```
|
|
|
|
#### BookWyrm
|
|
|
|
```promql
|
|
# BookWyrm-specific metrics (if exposed)
|
|
sum(rate(bookwyrm_requests_total[5m])) by (method, status)
|
|
```
|
|
|
|
### Database Connection Metrics (PostgreSQL)
|
|
|
|
```promql
|
|
# Active database connections by application
|
|
pg_application_connections{state="active"}
|
|
|
|
# Total connections by application
|
|
sum(pg_application_connections) by (app_name)
|
|
|
|
# Connection pool utilization
|
|
sum(pg_application_connections) by (app_name) / 100 # Adjust divisor based on max connections
|
|
```
|
|
|
|
### Celery Queue Metrics
|
|
|
|
```promql
|
|
# Queue length by application
|
|
sum(celery_queue_length{queue_name!="_total"}) by (database)
|
|
|
|
# Queue processing rate
|
|
sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60
|
|
|
|
# Stalled queues (no change in 15 minutes)
|
|
changes(celery_queue_length{queue_name="_total"}[15m]) == 0
|
|
and celery_queue_length{queue_name="_total"} > 100
|
|
```
|
|
|
|
#### Redis-Backed Queue Dashboard Panels
|
|
|
|
Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the `celery-metrics-exporter` in the `celery-monitoring` namespace.
|
|
|
|
- **Queue Depth per Queue (stacked area or line)**
|
|
|
|
```promql
|
|
sum by (database, queue_name) (
|
|
celery_queue_length{
|
|
queue_name!~"_total|_staging",
|
|
database=~"piefed|bookwyrm|mastodon"
|
|
}
|
|
)
|
|
```
|
|
|
|
This shows the absolute number of pending items in every discovered queue. Filter the `database` regex if you only want a single app. Switch the panel legend to `{{database}}/{{queue_name}}` so per-queue trends stand out.
|
|
|
|
- **Processing Rate per Queue (tasks/minute)**
|
|
|
|
```promql
|
|
-60 * sum by (database, queue_name) (
|
|
rate(
|
|
celery_queue_length{
|
|
queue_name!~"_total|_staging",
|
|
database=~"piefed|bookwyrm|mastodon"
|
|
}[5m]
|
|
)
|
|
)
|
|
```
|
|
|
|
The queue length decreases when workers drain tasks, so multiply the `rate()` by `-60` to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck.
|
|
|
|
> **Fallback**: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric `redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}`. Replace `celery_queue_length` with `redis_list_length` in both queries and keep the rest of the panel configuration identical.
|
|
|
|
An import-ready OpenObserve dashboard that contains these two panels lives at `docs/dashboards/openobserve-redis-queue-dashboard.json`. Import it via *Dashboards → Import* to jump-start the rebuild after a disaster recovery.
|
|
|
|
### Redis Metrics
|
|
|
|
```promql
|
|
# Redis connection status
|
|
redis_connection_status
|
|
|
|
# Redis memory usage (if available)
|
|
redis_memory_used_bytes
|
|
```
|
|
|
|
### Pod/Container Metrics
|
|
|
|
```promql
|
|
# CPU usage by application
|
|
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
|
|
|
|
# Memory usage by application
|
|
sum(container_memory_working_set_bytes) by (namespace, pod)
|
|
|
|
# Pod restarts
|
|
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod)
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Dashboard Panel Recommendations
|
|
|
|
### Panel 1: Overview
|
|
- **Total RPS** (all applications)
|
|
- **Total Error Rate** (all applications)
|
|
- **Average Response Time** (P95, all applications)
|
|
|
|
### Panel 2: Per-Application RPS
|
|
- Time series graph showing RPS for each application
|
|
- Use `sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)`
|
|
|
|
### Panel 3: Per-Application Latency
|
|
- P50, P95, P99 latency for each application
|
|
- Use histogram quantiles from ingress-nginx metrics
|
|
|
|
### Panel 4: Success/Error Rates
|
|
- Success rate (2xx) by application
|
|
- Error rate (4xx + 5xx) by application
|
|
- Status code breakdown
|
|
|
|
### Panel 5: Top Endpoints
|
|
- Top 10 endpoints by volume
|
|
- Top 10 slowest endpoints
|
|
|
|
### Panel 6: Database Health
|
|
- Active connections by application
|
|
- Connection pool utilization
|
|
|
|
### Panel 7: Queue Health (Celery)
|
|
- Queue lengths by application
|
|
- Processing rates
|
|
|
|
### Panel 8: Resource Usage
|
|
- CPU usage by application
|
|
- Memory usage by application
|
|
- Pod restart counts
|
|
|
|
---
|
|
|
|
## 6. Alerting Queries
|
|
|
|
### High Error Rate
|
|
|
|
```promql
|
|
# Alert if error rate > 5% for any application
|
|
(
|
|
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
|
) > 0.05
|
|
```
|
|
|
|
### High Latency
|
|
|
|
```promql
|
|
# Alert if P95 latency > 2 seconds
|
|
histogram_quantile(0.95,
|
|
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le)
|
|
) > 2
|
|
```
|
|
|
|
### Low Success Rate
|
|
|
|
```promql
|
|
# Alert if success rate < 95%
|
|
(
|
|
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace)
|
|
/
|
|
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
|
) < 0.95
|
|
```
|
|
|
|
### High Request Volume (Spike Detection)
|
|
|
|
```promql
|
|
# Alert if RPS increases by 3x in 5 minutes
|
|
rate(nginx_ingress_controller_requests[5m])
|
|
>
|
|
3 * rate(nginx_ingress_controller_requests[5m] offset 5m)
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Notes on Metric Naming
|
|
|
|
- **Ingress-nginx metrics** are the most reliable for HTTP request metrics
|
|
- **spanmetrics** may have different label names depending on k8s attribute processor configuration
|
|
- Check actual metric names in OpenObserve using: `{__name__=~".*request.*|.*http.*|.*latency.*"}`
|
|
- Service names from spanmetrics may need to be mapped to application names
|
|
|
|
## 8. Troubleshooting
|
|
|
|
If metrics don't appear:
|
|
|
|
1. **Check ServiceMonitors are active:**
|
|
```bash
|
|
kubectl get servicemonitors -A
|
|
```
|
|
|
|
2. **Verify Prometheus receiver is scraping:**
|
|
Check OpenTelemetry collector logs for scraping errors
|
|
|
|
3. **Verify metric names:**
|
|
Query OpenObserve for available metrics:
|
|
```promql
|
|
{__name__=~".*"}
|
|
```
|
|
|
|
4. **Check label names:**
|
|
The actual label names may vary. Common variations:
|
|
- `namespace` vs `k8s.namespace.name`
|
|
- `service_name` vs `service.name`
|
|
- `ingress` vs `ingress_name`
|
|
|
|
---
|
|
|
|
## Quick Reference: Application Namespaces
|
|
|
|
- Mastodon: `mastodon-application`
|
|
- Pixelfed: `pixelfed-application`
|
|
- PieFed: `piefed-application`
|
|
- BookWyrm: `bookwyrm-application`
|
|
- Picsur: `picsur`
|
|
- Write Freely: `write-freely`
|
|
|