Files
Keybard-Vagabond-Demo/docs/openobserve-dashboard-promql-queries.md

14 KiB

OpenObserve Dashboard PromQL Queries

This document provides PromQL queries for rebuilding OpenObserve dashboards after disaster recovery. The queries are organized by metric type and application.

Metric Sources

Your cluster has multiple metric sources:

  1. OpenTelemetry spanmetrics - Generates metrics from traces (calls_total, latency)
  2. Ingress-nginx - HTTP request metrics at the ingress layer
  3. Application metrics - Direct metrics from applications (Mastodon, BookWyrm, etc.)

Applications

  • Mastodon (mastodon-application)
  • Pixelfed (pixelfed-application)
  • PieFed (piefed-application)
  • BookWyrm (bookwyrm-application)
  • Picsur (picsur)
  • Write Freely (write-freely)

1. Requests Per Second (RPS) by Application

# Total RPS by application (via ingress)
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)

# RPS by application and status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)

# RPS by application and HTTP method
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, method)

# RPS for specific applications
sum(rate(nginx_ingress_controller_requests{namespace=~"mastodon-application|pixelfed-application|piefed-application|bookwyrm-application"}[5m])) by (ingress, namespace)

Using OpenTelemetry spanmetrics

# RPS from spanmetrics (if service names are properly labeled)
sum(rate(calls_total[5m])) by (service_name)

# RPS by application namespace (if k8s attributes are present)
sum(rate(calls_total[5m])) by (k8s.namespace.name, service_name)

# RPS by application and HTTP method
sum(rate(calls_total[5m])) by (service_name, http.method)

# RPS by application and status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)

Combined View (All Applications)

# All applications RPS
sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)

2. Request Duration by Application

Using Ingress-Nginx Metrics

# Average request duration by application
sum(rate(nginx_ingress_controller_request_duration_seconds_sum[5m])) by (ingress, namespace) 
/ 
sum(rate(nginx_ingress_controller_request_duration_seconds_count[5m])) by (ingress, namespace)

# P50 (median) request duration
histogram_quantile(0.50, 
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)

# P95 request duration
histogram_quantile(0.95, 
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)

# P99 request duration
histogram_quantile(0.99, 
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)

# P99.9 request duration (for tail latency)
histogram_quantile(0.999, 
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)

# Max request duration
max(nginx_ingress_controller_request_duration_seconds) by (ingress, namespace)

Using OpenTelemetry spanmetrics

# Average latency from spanmetrics
sum(rate(latency_sum[5m])) by (service_name) 
/ 
sum(rate(latency_count[5m])) by (service_name)

# P50 latency
histogram_quantile(0.50, 
  sum(rate(latency_bucket[5m])) by (service_name, le)
)

# P95 latency
histogram_quantile(0.95, 
  sum(rate(latency_bucket[5m])) by (service_name, le)
)

# P99 latency
histogram_quantile(0.99, 
  sum(rate(latency_bucket[5m])) by (service_name, le)
)

# Latency by HTTP method
histogram_quantile(0.95, 
  sum(rate(latency_bucket[5m])) by (service_name, http.method, le)
)

Response Duration (Backend Processing Time)

# Average backend response duration
sum(rate(nginx_ingress_controller_response_duration_seconds_sum[5m])) by (ingress, namespace) 
/ 
sum(rate(nginx_ingress_controller_response_duration_seconds_count[5m])) by (ingress, namespace)

# P95 backend response duration
histogram_quantile(0.95, 
  sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[5m])) by (ingress, namespace, le)
)

3. Success Rate by Application

Using Ingress-Nginx Metrics

# Success rate (2xx / total requests) by application
sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)

# Success rate as percentage
(
  sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (ingress, namespace)
  /
  sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100

# Error rate (4xx + 5xx) by application
sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)

# Error rate as percentage
(
  sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (ingress, namespace)
  /
  sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace)
) * 100

# Breakdown by status code
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, namespace, status)

# 5xx errors specifically
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace)

Using OpenTelemetry spanmetrics

# Success rate from spanmetrics
sum(rate(calls_total{http.status_code=~"2.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)

# Error rate from spanmetrics
sum(rate(calls_total{http.status_code=~"4..|5.."}[5m])) by (service_name)
/
sum(rate(calls_total[5m])) by (service_name)

# Breakdown by status code
sum(rate(calls_total[5m])) by (service_name, http.status_code)

4. Additional Best Practice Metrics

# Requests per minute (for trend analysis)
sum(rate(nginx_ingress_controller_requests[1m])) by (namespace) * 60

# Total requests in last hour
sum(increase(nginx_ingress_controller_requests[1h])) by (namespace)

Top Endpoints

# Top endpoints by request volume
topk(10, sum(rate(nginx_ingress_controller_requests[5m])) by (ingress, path))

# Top slowest endpoints (P95)
topk(10, 
  histogram_quantile(0.95, 
    sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (ingress, path, le)
  )
)

Error Analysis

# 4xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"4.."}[5m])) by (ingress, namespace, status)

# 5xx errors by application
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress, namespace, status)

# Error rate trend (detect spikes)
rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])

Throughput Metrics

# Bytes sent per second
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace)

# Bytes received per second
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)

# Total bandwidth usage
sum(rate(nginx_ingress_controller_bytes_sent[5m])) by (ingress, namespace) 
+ 
sum(rate(nginx_ingress_controller_bytes_received[5m])) by (ingress, namespace)

Connection Metrics

# Active connections
sum(nginx_ingress_controller_connections) by (ingress, namespace, state)

# Connection rate
sum(rate(nginx_ingress_controller_connections[5m])) by (ingress, namespace, state)

Application-Specific Metrics

Mastodon

# Mastodon-specific metrics (if exposed)
sum(rate(mastodon_http_requests_total[5m])) by (method, status)
sum(rate(mastodon_http_request_duration_seconds[5m])) by (method)

BookWyrm

# BookWyrm-specific metrics (if exposed)
sum(rate(bookwyrm_requests_total[5m])) by (method, status)

Database Connection Metrics (PostgreSQL)

# Active database connections by application
pg_application_connections{state="active"}

# Total connections by application
sum(pg_application_connections) by (app_name)

# Connection pool utilization
sum(pg_application_connections) by (app_name) / 100  # Adjust divisor based on max connections

Celery Queue Metrics

# Queue length by application
sum(celery_queue_length{queue_name!="_total"}) by (database)

# Queue processing rate
sum(rate(celery_queue_length{queue_name!="_total"}[5m])) by (database) * -60

# Stalled queues (no change in 15 minutes)
changes(celery_queue_length{queue_name="_total"}[15m]) == 0 
and celery_queue_length{queue_name="_total"} > 100

Redis-Backed Queue Dashboard Panels

Use these two panel queries to rebuild the Redis/Celery queue dashboard after a wipe. Both panels assume metrics are flowing from the celery-metrics-exporter in the celery-monitoring namespace.

  • Queue Depth per Queue (stacked area or line)

    sum by (database, queue_name) (
      celery_queue_length{
        queue_name!~"_total|_staging",
        database=~"piefed|bookwyrm|mastodon"
      }
    )
    

    This shows the absolute number of pending items in every discovered queue. Filter the database regex if you only want a single app. Switch the panel legend to {{database}}/{{queue_name}} so per-queue trends stand out.

  • Processing Rate per Queue (tasks/minute)

    -60 * sum by (database, queue_name) (
      rate(
        celery_queue_length{
          queue_name!~"_total|_staging",
          database=~"piefed|bookwyrm|mastodon"
        }[5m]
      )
    )
    

    The queue length decreases when workers drain tasks, so multiply the rate() by -60 to turn that negative slope into a positive “tasks per minute processed” number. Values that stay near zero for a busy queue are a red flag that workers are stuck.

Fallback: If the custom exporter is down, you can build the same dashboards off the upstream Redis exporter metric redis_list_length{alias="redis-ha",key=~"celery|*_priority|high|low"}. Replace celery_queue_length with redis_list_length in both queries and keep the rest of the panel configuration identical.

An import-ready OpenObserve dashboard that contains these two panels lives at docs/dashboards/openobserve-redis-queue-dashboard.json. Import it via Dashboards → Import to jump-start the rebuild after a disaster recovery.

Redis Metrics

# Redis connection status
redis_connection_status

# Redis memory usage (if available)
redis_memory_used_bytes

Pod/Container Metrics

# CPU usage by application
sum(rate(container_cpu_usage_seconds_total[5m])) by (namespace, pod)

# Memory usage by application
sum(container_memory_working_set_bytes) by (namespace, pod)

# Pod restarts
sum(increase(kube_pod_container_status_restarts_total[1h])) by (namespace, pod)

5. Dashboard Panel Recommendations

Panel 1: Overview

  • Total RPS (all applications)
  • Total Error Rate (all applications)
  • Average Response Time (P95, all applications)

Panel 2: Per-Application RPS

  • Time series graph showing RPS for each application
  • Use sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)

Panel 3: Per-Application Latency

  • P50, P95, P99 latency for each application
  • Use histogram quantiles from ingress-nginx metrics

Panel 4: Success/Error Rates

  • Success rate (2xx) by application
  • Error rate (4xx + 5xx) by application
  • Status code breakdown

Panel 5: Top Endpoints

  • Top 10 endpoints by volume
  • Top 10 slowest endpoints

Panel 6: Database Health

  • Active connections by application
  • Connection pool utilization

Panel 7: Queue Health (Celery)

  • Queue lengths by application
  • Processing rates

Panel 8: Resource Usage

  • CPU usage by application
  • Memory usage by application
  • Pod restart counts

6. Alerting Queries

High Error Rate

# Alert if error rate > 5% for any application
(
  sum(rate(nginx_ingress_controller_requests{status=~"4..|5.."}[5m])) by (namespace)
  /
  sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) > 0.05

High Latency

# Alert if P95 latency > 2 seconds
histogram_quantile(0.95, 
  sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (namespace, le)
) > 2

Low Success Rate

# Alert if success rate < 95%
(
  sum(rate(nginx_ingress_controller_requests{status=~"2.."}[5m])) by (namespace)
  /
  sum(rate(nginx_ingress_controller_requests[5m])) by (namespace)
) < 0.95

High Request Volume (Spike Detection)

# Alert if RPS increases by 3x in 5 minutes
rate(nginx_ingress_controller_requests[5m]) 
> 
3 * rate(nginx_ingress_controller_requests[5m] offset 5m)

7. Notes on Metric Naming

  • Ingress-nginx metrics are the most reliable for HTTP request metrics
  • spanmetrics may have different label names depending on k8s attribute processor configuration
  • Check actual metric names in OpenObserve using: {__name__=~".*request.*|.*http.*|.*latency.*"}
  • Service names from spanmetrics may need to be mapped to application names

8. Troubleshooting

If metrics don't appear:

  1. Check ServiceMonitors are active:

    kubectl get servicemonitors -A
    
  2. Verify Prometheus receiver is scraping: Check OpenTelemetry collector logs for scraping errors

  3. Verify metric names: Query OpenObserve for available metrics:

    {__name__=~".*"}
    
  4. Check label names: The actual label names may vary. Common variations:

    • namespace vs k8s.namespace.name
    • service_name vs service.name
    • ingress vs ingress_name

Quick Reference: Application Namespaces

  • Mastodon: mastodon-application
  • Pixelfed: pixelfed-application
  • PieFed: piefed-application
  • BookWyrm: bookwyrm-application
  • Picsur: picsur
  • Write Freely: write-freely