Files
Keybard-Vagabond-Demo/manifests/applications/bookwyrm/BEAT-TO-CRONJOB-MIGRATION.md
Michael DiLeo 7327d77dcd redaction (#1)
Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>
2025-12-24 13:40:47 +00:00

7.5 KiB

BookWyrm Celery Beat to Kubernetes CronJob Migration

Overview

This document outlines the migration from BookWyrm's Celery beat container to Kubernetes CronJobs. The beat container currently runs continuously and schedules periodic tasks, but this can be replaced with more efficient Kubernetes-native CronJobs.

Current Beat Container Analysis

What Celery Beat Does

The current deployment-beat.yaml runs a Celery beat scheduler that:

  • Uses django_celery_beat.schedulers:DatabaseScheduler to store schedules in the database
  • Manages periodic task execution by queuing tasks to Redis for workers to pick up
  • Runs continuously consuming resources (100m CPU, 256Mi memory)

Scheduled Tasks Identified

Through analysis of the BookWyrm source code, we identified two main periodic tasks:

  1. Automod Task (bookwyrm.models.antispam.automod_task)

    • Function: Scans users and statuses for moderation flags based on AutoMod rules
    • Purpose: Automatically flags suspicious content and users for moderator review
    • Trigger: Only runs when AutoMod rules exist in the database
    • Recommended Schedule: Every 6 hours (adjustable based on community size)
  2. Update Check Task (bookwyrm.models.site.check_for_updates_task)

    • Function: Checks GitHub API for new BookWyrm releases
    • Purpose: Notifies administrators when updates are available
    • Trigger: Makes HTTP request to GitHub releases API
    • Recommended Schedule: Daily at 3:00 AM UTC

Migration Strategy

  1. Deploy CronJobs alongside existing beat container
  2. Monitor CronJob execution for several days
  3. Verify tasks execute correctly and at expected intervals
  4. Compare resource usage between approaches

Phase 2: Beat Container Removal

  1. Remove deployment-beat.yaml from kustomization
  2. Clean up any database-stored periodic tasks (if desired)
  3. Monitor for any missed functionality

CronJob Implementation

Key Design Decisions

  1. Direct Task Execution: Instead of going through Celery, CronJobs execute tasks directly using Django management shell
  2. Resource Optimization: Each job uses minimal resources (50-100m CPU, 128-256Mi memory) and only when running
  3. Security: Same security context as other BookWyrm containers (non-root, dropped capabilities)
  4. Scheduling: Uses standard cron expressions for predictable timing
  5. Job Management: Configures history limits and TTL for automatic cleanup

CronJob Specifications

Automod CronJob

  • Schedule: 0 */6 * * * (every 6 hours)
  • Command: Direct Python execution of automod_task()
  • Resources: 50m CPU, 128Mi memory
  • Concurrency: Forbid (prevent overlapping executions)

Update Check CronJob

  • Schedule: 0 3 * * * (daily at 3:00 AM UTC)
  • Command: Direct Python execution of check_for_updates_task()
  • Resources: 50m CPU, 128Mi memory
  • Concurrency: Forbid (prevent overlapping executions)

Database Cleanup CronJob (Bonus)

  • Schedule: 0 2 * * 0 (weekly on Sunday at 2:00 AM UTC)
  • Command: Django shell script to clean expired sessions and old notifications
  • Resources: 100m CPU, 256Mi memory
  • Purpose: Maintain database health (not part of original beat functionality)

Benefits of Migration

Resource Efficiency

  • Before: Beat container runs 24/7 consuming ~100m CPU and 256Mi memory
  • After: CronJobs run only when needed, typically <1 minute execution time
  • Savings: ~99% reduction in resource usage for periodic tasks

Operational Benefits

  • Kubernetes Native: Leverage built-in CronJob features (history, TTL, concurrency control)
  • Observability: Better visibility into job execution and failures
  • Scaling: No single point of failure for task scheduling
  • Maintenance: Easier to modify schedules without redeploying beat container

Simplified Architecture

  • Removes dependency on Celery beat scheduler
  • Reduces Redis usage (no beat schedule storage)
  • Eliminates one running container (reduced complexity)

Migration Steps

1. Deploy CronJobs

# Apply the new CronJob manifests
kubectl apply -f manifests/applications/bookwyrm/cronjobs.yaml

2. Verify CronJob Creation

# Check CronJobs are created
kubectl get cronjobs -n bookwyrm-application

# Check for any immediate execution (if testing)
kubectl get jobs -n bookwyrm-application

3. Monitor Execution (Run for 1-2 weeks)

# Watch job execution
kubectl get jobs -n bookwyrm-application -w

# Check job logs
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application
kubectl logs job/bookwyrm-update-check-<timestamp> -n bookwyrm-application

4. Optional: Disable Beat Container (Testing)

# Scale down beat deployment temporarily
kubectl scale deployment bookwyrm-beat --replicas=0 -n bookwyrm-application

# Monitor for any issues for several days

5. Permanent Migration

# Remove beat from kustomization.yaml
# Comment out or remove: - deployment-beat.yaml

# Apply changes
kubectl apply -k manifests/applications/bookwyrm/

6. Cleanup (Optional)

# Remove beat deployment entirely
kubectl delete deployment bookwyrm-beat -n bookwyrm-application

# Clean up database periodic tasks (if desired)
# This requires connecting to BookWyrm admin panel or database directly

Schedule Customization

Automod Schedule Adjustment

If your instance has high activity, you might want more frequent automod checks:

# For every 2 hours instead of 6:
schedule: "0 */2 * * *"

# For hourly:
schedule: "0 * * * *"

Update Check Frequency

For development instances, you might want more frequent update checks:

# For twice daily:
schedule: "0 3,15 * * *"

# For weekly instead of daily:
schedule: "0 3 * * 0"

Troubleshooting

CronJob Not Executing

# Check CronJob status
kubectl describe cronjob bookwyrm-automod -n bookwyrm-application

# Check for suspended jobs
kubectl get cronjobs -n bookwyrm-application -o wide

Job Failures

# Check failed job logs
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application

# Common issues:
# - Database connection problems
# - Missing environment variables
# - Redis connectivity issues

Missed Executions

# Check for node resource constraints
kubectl top nodes

# Verify startingDeadlineSeconds is appropriate
# Current setting: 600 seconds (10 minutes)

Rollback Plan

If issues arise, rollback is straightforward:

  1. Scale up beat container:

    kubectl scale deployment bookwyrm-beat --replicas=1 -n bookwyrm-application
    
  2. Remove CronJobs:

    kubectl delete cronjobs bookwyrm-automod bookwyrm-update-check -n bookwyrm-application
    
  3. Restore original kustomization.yaml

Monitoring and Alerting

Consider setting up monitoring for:

  • CronJob execution failures
  • Job duration anomalies
  • Missing job executions
  • Resource usage patterns

Example Prometheus alert:

- alert: BookWyrmCronJobFailed
  expr: kube_job_status_failed{namespace="bookwyrm-application"} > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: "BookWyrm CronJob failed"
    description: "CronJob {{ $labels.job_name }} failed in namespace {{ $labels.namespace }}"

Conclusion

This migration replaces the continuously running Celery beat container with efficient Kubernetes CronJobs, providing the same functionality with significantly reduced resource consumption and improved operational characteristics. The migration can be done gradually with minimal risk.