redaction (#1)
Add the redacted source file for demo purposes Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1 Co-authored-by: Michael DiLeo <michael_dileo@proton.me> Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
This commit was merged in pull request #1.
This commit is contained in:
493
docs/openobserve-dashboard-promql-queries.md
Normal file
493
docs/openobserve-dashboard-promql-queries.md
Normal file
@@ -0,0 +1,493 @@
|
||||
# OpenObserve Dashboard PromQL Queries
|
||||
|
||||
This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application.
|
||||
|
||||
## Metric Sources
|
||||
|
||||
Your cluster has multiple metric sources:
|
||||
1. **OpenTelemetry spanmetrics** - Generates metrics from traces (`calls_total`, `latency`)
|
||||
2. **Ingress-nginx** - HTTP request metrics at the ingress layer
|
||||
3. **Application metrics** - Direct metrics from applications (Mastodon, BookWyrm, etc.)
|
||||
|
||||
## Applications
|
||||
|
||||
- **Mastodon** (`mastodon-application`)
|
||||
- **Pixelfed** (`pixelfed-application`)
|
||||
- **PieFed** (`piefed-application`)
|
||||
- **BookWyrm** (`bookwyrm-application`)
|
||||
- **Picsur** (`picsur`)
|
||||
- **Write Freely** (`write-freely`)
|
||||
|
||||
---
|
||||
|
||||
## 1. Requests Per Second (RPS) by Application
|
||||
|
||||
### Using Ingress-Nginx Metrics (Recommended - Most Reliable)
|
||||
|
||||
```promql
|
||||
# Total RPS by application (via ingress)
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
||||
|
||||
# RPS by application and status code
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
|
||||
|
||||
# RPS by application and HTTP method
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method)
|
||||
|
||||
# RPS for specific applications
|
||||
sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace)
|
||||
```
|
||||
|
||||
### Using OpenTelemetry spanmetrics
|
||||
|
||||
```promql
|
||||
# RPS from spanmetrics (if service names are properly labeled)
|
||||
sum(rate(calls_total[5m])) by (service_name)
|
||||
|
||||
# RPS by application namespace (if k8s attributes are present)
|
||||
sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name)
|
||||
|
||||
# RPS by application and HTTP method
|
||||
sum(rate(calls_total[5m])) by (service_name, http.method)
|
||||
|
||||
# RPS by application and status code
|
||||
sum(rate(calls_total[5m])) by (service_name, http.status_code)
|
||||
```
|
||||
|
||||
### Combined View (All Applications)
|
||||
|
||||
```promql
|
||||
# All applications RPS
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Request Duration by Application
|
||||
|
||||
### Using Ingress-Nginx Metrics
|
||||
|
||||
```promql
|
||||
# Average request duration by application
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace)
|
||||
|
||||
# P50 (median) request duration
|
||||
histogram_quantile(0.50,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
||||
)
|
||||
|
||||
# P95 request duration
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
||||
)
|
||||
|
||||
# P99 request duration
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
||||
)
|
||||
|
||||
# P99.9 request duration (for tail latency)
|
||||
histogram_quantile(0.999,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
||||
)
|
||||
|
||||
# Max request duration
|
||||
max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace)
|
||||
```
|
||||
|
||||
### Using OpenTelemetry spanmetrics
|
||||
|
||||
```promql
|
||||
# Average latency from spanmetrics
|
||||
sum(rate(latency_sum[5m])) by (service_name)
|
||||
/
|
||||
sum(rate(latency_count[5m])) by (service_name)
|
||||
|
||||
# P50 latency
|
||||
histogram_quantile(0.50,
|
||||
sum(rate(latency_bucket[5m])) by (service_name, le)
|
||||
)
|
||||
|
||||
# P95 latency
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(latency_bucket[5m])) by (service_name, le)
|
||||
)
|
||||
|
||||
# P99 latency
|
||||
histogram_quantile(0.99,
|
||||
sum(rate(latency_bucket[5m])) by (service_name, le)
|
||||
)
|
||||
|
||||
# Latency by HTTP method
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(latency_bucket[5m])) by (service_name, http.method, le)
|
||||
)
|
||||
```
|
||||
|
||||
### Response Duration (Backend Processing Time)
|
||||
|
||||
```promql
|
||||
# Average backend response duration
|
||||
sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace)
|
||||
|
||||
# P95 backend response duration
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le)
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Success Rate by Application
|
||||
|
||||
### Using Ingress-Nginx Metrics
|
||||
|
||||
```promql
|
||||
# Success rate (2xx / total requests) by application
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
||||
|
||||
# Success rate as percentage
|
||||
(
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
||||
) * 100
|
||||
|
||||
# Error rate (4xx + 5xx) by application
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
||||
|
||||
# Error rate as percentage
|
||||
(
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
|
||||
) * 100
|
||||
|
||||
# Breakdown by status code
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)
|
||||
|
||||
# 5xx errors specifically
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace)
|
||||
```
|
||||
|
||||
### Using OpenTelemetry spanmetrics
|
||||
|
||||
```promql
|
||||
# Success rate from spanmetrics
|
||||
sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name)
|
||||
/
|
||||
sum(rate(calls_total[5m])) by (service_name)
|
||||
|
||||
# Error rate from spanmetrics
|
||||
sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name)
|
||||
/
|
||||
sum(rate(calls_total[5m])) by (service_name)
|
||||
|
||||
# Breakdown by status code
|
||||
sum(rate(calls_total[5m])) by (service_name, http.status_code)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Additional Best Practice Metrics
|
||||
|
||||
### Request Volume Trends
|
||||
|
||||
```promql
|
||||
# Requests per minute (for trend analysis)
|
||||
sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60
|
||||
|
||||
# Total requests in last hour
|
||||
sum(increase(nginx_ingress_controller_requests[1h])) by (namespace)
|
||||
```
|
||||
|
||||
### Top Endpoints
|
||||
|
||||
```promql
|
||||
# Top endpoints by request volume
|
||||
topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path))
|
||||
|
||||
# Top slowest endpoints (P95)
|
||||
topk(10,
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le)
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Error Analysis
|
||||
|
||||
```promql
|
||||
# 4xx errors by application
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status)
|
||||
|
||||
# 5xx errors by application
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status)
|
||||
|
||||
# Error rate trend (detect spikes)
|
||||
rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])
|
||||
```
|
||||
|
||||
### Throughput Metrics
|
||||
|
||||
```promql
|
||||
# Bytes sent per second
|
||||
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
|
||||
|
||||
# Bytes received per second
|
||||
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
|
||||
|
||||
# Total bandwidth usage
|
||||
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)
|
||||
+
|
||||
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)
|
||||
```
|
||||
|
||||
### Connection Metrics
|
||||
|
||||
```promql
|
||||
# Active connections
|
||||
sum(nginx_ingress_controller_connections) by (ingress, namespace, state)
|
||||
|
||||
# Connection rate
|
||||
sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state)
|
||||
```
|
||||
|
||||
### Application-Specific Metrics
|
||||
|
||||
#### Mastodon
|
||||
|
||||
```promql
|
||||
# Mastodon-specific metrics (if exposed)
|
||||
sum(rate(mastodon_http_requests_total[5m])) by (method, status)
|
||||
sum(rate(mastodon_http_request_duration_seconds[5m])) by (method)
|
||||
```
|
||||
|
||||
#### BookWyrm
|
||||
|
||||
```promql
|
||||
# BookWyrm-specific metrics (if exposed)
|
||||
sum(rate(bookwyrm_requests_total[5m])) by (method, status)
|
||||
```
|
||||
|
||||
### Database Connection Metrics (PostgreSQL)
|
||||
|
||||
```promql
|
||||
# Active database connections by application
|
||||
pg_application_connections{state="active"}
|
||||
|
||||
# Total connections by application
|
||||
sum(pg_application_connections) by (app_name)
|
||||
|
||||
# Connection pool utilization
|
||||
sum(pg_application_connections) by (app_name) / 100 # Adjust divisor based on max connections
|
||||
```
|
||||
|
||||
### Celery Queue Metrics
|
||||
|
||||
```promql
|
||||
# Queue length by application
|
||||
sum(celery_queue_length{queue_name!="_total"}) by (database)
|
||||
|
||||
# Queue processing rate
|
||||
sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60
|
||||
|
||||
# Stalled queues (no change in 15 minutes)
|
||||
changes(celery_queue_length{queue_name="_total"}[15m]) == 0
|
||||
and celery_queue_length{queue_name="_total"} > 100
|
||||
```
|
||||
|
||||
#### Redis-Backed Queue Dashboard Panels
|
||||
|
||||
Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the `celery-metrics-exporter` in the `celery-monitoring` namespace.
|
||||
|
||||
- **Queue Depth per Queue (stacked area or line)**
|
||||
|
||||
```promql
|
||||
sum by (database, queue_name) (
|
||||
celery_queue_length{
|
||||
queue_name!~"_total|_staging",
|
||||
database=~"piefed|bookwyrm|mastodon"
|
||||
}
|
||||
)
|
||||
```
|
||||
|
||||
This shows the absolute number of pending items in every discovered queue. Filter the `database` regex if you only want a single app. Switch the panel legend to `{{database}}/{{queue_name}}` so per-queue trends stand out.
|
||||
|
||||
- **Processing Rate per Queue (tasks/minute)**
|
||||
|
||||
```promql
|
||||
-60 * sum by (database, queue_name) (
|
||||
rate(
|
||||
celery_queue_length{
|
||||
queue_name!~"_total|_staging",
|
||||
database=~"piefed|bookwyrm|mastodon"
|
||||
}[5m]
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
The queue length decreases when workers drain tasks, so multiply the `rate()` by `-60` to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck.
|
||||
|
||||
> **Fallback**: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric `redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}`. Replace `celery_queue_length` with `redis_list_length` in both queries and keep the rest of the panel configuration identical.
|
||||
|
||||
An import-ready OpenObserve dashboard that contains these two panels lives at `docs/dashboards/openobserve-redis-queue-dashboard.json`. Import it via *Dashboards → Import* to jump-start the rebuild after a disaster recovery.
|
||||
|
||||
### Redis Metrics
|
||||
|
||||
```promql
|
||||
# Redis connection status
|
||||
redis_connection_status
|
||||
|
||||
# Redis memory usage (if available)
|
||||
redis_memory_used_bytes
|
||||
```
|
||||
|
||||
### Pod/Container Metrics
|
||||
|
||||
```promql
|
||||
# CPU usage by application
|
||||
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)
|
||||
|
||||
# Memory usage by application
|
||||
sum(container_memory_working_set_bytes) by (namespace, pod)
|
||||
|
||||
# Pod restarts
|
||||
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Dashboard Panel Recommendations
|
||||
|
||||
### Panel 1: Overview
|
||||
- **Total RPS** (all applications)
|
||||
- **Total Error Rate** (all applications)
|
||||
- **Average Response Time** (P95, all applications)
|
||||
|
||||
### Panel 2: Per-Application RPS
|
||||
- Time series graph showing RPS for each application
|
||||
- Use `sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)`
|
||||
|
||||
### Panel 3: Per-Application Latency
|
||||
- P50, P95, P99 latency for each application
|
||||
- Use histogram quantiles from ingress-nginx metrics
|
||||
|
||||
### Panel 4: Success/Error Rates
|
||||
- Success rate (2xx) by application
|
||||
- Error rate (4xx + 5xx) by application
|
||||
- Status code breakdown
|
||||
|
||||
### Panel 5: Top Endpoints
|
||||
- Top 10 endpoints by volume
|
||||
- Top 10 slowest endpoints
|
||||
|
||||
### Panel 6: Database Health
|
||||
- Active connections by application
|
||||
- Connection pool utilization
|
||||
|
||||
### Panel 7: Queue Health (Celery)
|
||||
- Queue lengths by application
|
||||
- Processing rates
|
||||
|
||||
### Panel 8: Resource Usage
|
||||
- CPU usage by application
|
||||
- Memory usage by application
|
||||
- Pod restart counts
|
||||
|
||||
---
|
||||
|
||||
## 6. Alerting Queries
|
||||
|
||||
### High Error Rate
|
||||
|
||||
```promql
|
||||
# Alert if error rate > 5% for any application
|
||||
(
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
||||
) > 0.05
|
||||
```
|
||||
|
||||
### High Latency
|
||||
|
||||
```promql
|
||||
# Alert if P95 latency > 2 seconds
|
||||
histogram_quantile(0.95,
|
||||
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le)
|
||||
) > 2
|
||||
```
|
||||
|
||||
### Low Success Rate
|
||||
|
||||
```promql
|
||||
# Alert if success rate < 95%
|
||||
(
|
||||
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace)
|
||||
/
|
||||
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
|
||||
) < 0.95
|
||||
```
|
||||
|
||||
### High Request Volume (Spike Detection)
|
||||
|
||||
```promql
|
||||
# Alert if RPS increases by 3x in 5 minutes
|
||||
rate(nginx_ingress_controller_requests[5m])
|
||||
>
|
||||
3 * rate(nginx_ingress_controller_requests[5m] offset 5m)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 7. Notes on Metric Naming
|
||||
|
||||
- **Ingress-nginx metrics** are the most reliable for HTTP request metrics
|
||||
- **spanmetrics** may have different label names depending on k8s attribute processor configuration
|
||||
- Check actual metric names in OpenObserve using: `{__name__=~".*request.*|.*http.*|.*latency.*"}`
|
||||
- Service names from spanmetrics may need to be mapped to application names
|
||||
|
||||
## 8. Troubleshooting
|
||||
|
||||
If metrics don't appear:
|
||||
|
||||
1. **Check ServiceMonitors are active:**
|
||||
```bash
|
||||
kubectl get servicemonitors -A
|
||||
```
|
||||
|
||||
2. **Verify Prometheus receiver is scraping:**
|
||||
Check OpenTelemetry collector logs for scraping errors
|
||||
|
||||
3. **Verify metric names:**
|
||||
Query OpenObserve for available metrics:
|
||||
```promql
|
||||
{__name__=~".*"}
|
||||
```
|
||||
|
||||
4. **Check label names:**
|
||||
The actual label names may vary. Common variations:
|
||||
- `namespace` vs `k8s.namespace.name`
|
||||
- `service_name` vs `service.name`
|
||||
- `ingress` vs `ingress_name`
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: Application Namespaces
|
||||
|
||||
- Mastodon: `mastodon-application`
|
||||
- Pixelfed: `pixelfed-application`
|
||||
- PieFed: `piefed-application`
|
||||
- BookWyrm: `bookwyrm-application`
|
||||
- Picsur: `picsur`
|
||||
- Write Freely: `write-freely`
|
||||
|
||||
Reference in New Issue
Block a user