Files

Michael DiLeo 7327d77dcd redaction (#1 )

Add the redacted source file for demo purposes

Reviewed-on: https://source.michaeldileo.org/michael_dileo/Keybard-Vagabond-Demo/pulls/1
Co-authored-by: Michael DiLeo <michael_dileo@proton.me>
Co-committed-by: Michael DiLeo <michael_dileo@proton.me>

2025-12-24 13:40:47 +00:00

7.5 KiB

Raw Blame History

BookWyrm Celery Beat to Kubernetes CronJob Migration

Overview

This document outlines the migration from BookWyrm's Celery beat container to Kubernetes CronJobs. The beat container currently runs continuously and schedules periodic tasks, but this can be replaced with more efficient Kubernetes-native CronJobs.

Current Beat Container Analysis

What Celery Beat Does

The current deployment-beat.yaml runs a Celery beat scheduler that:

Uses django_celery_beat.schedulers:DatabaseScheduler to store schedules in the database
Manages periodic task execution by queuing tasks to Redis for workers to pick up
Runs continuously consuming resources (100m CPU, 256Mi memory)

Scheduled Tasks Identified

Through analysis of the BookWyrm source code, we identified two main periodic tasks:

Automod Task (bookwyrm.models.antispam.automod_task)
- Function: Scans users and statuses for moderation flags based on AutoMod rules
- Purpose: Automatically flags suspicious content and users for moderator review
- Trigger: Only runs when AutoMod rules exist in the database
- Recommended Schedule: Every 6 hours (adjustable based on community size)
Update Check Task (bookwyrm.models.site.check_for_updates_task)
- Function: Checks GitHub API for new BookWyrm releases
- Purpose: Notifies administrators when updates are available
- Trigger: Makes HTTP request to GitHub releases API
- Recommended Schedule: Daily at 3:00 AM UTC

Migration Strategy

Phase 1: Parallel Operation (Recommended)

Deploy CronJobs alongside existing beat container
Monitor CronJob execution for several days
Verify tasks execute correctly and at expected intervals
Compare resource usage between approaches

Phase 2: Beat Container Removal

Remove deployment-beat.yaml from kustomization
Clean up any database-stored periodic tasks (if desired)
Monitor for any missed functionality

CronJob Implementation

Key Design Decisions

Direct Task Execution: Instead of going through Celery, CronJobs execute tasks directly using Django management shell
Resource Optimization: Each job uses minimal resources (50-100m CPU, 128-256Mi memory) and only when running
Security: Same security context as other BookWyrm containers (non-root, dropped capabilities)
Scheduling: Uses standard cron expressions for predictable timing
Job Management: Configures history limits and TTL for automatic cleanup

CronJob Specifications

Automod CronJob

Schedule: 0 */6 * * * (every 6 hours)
Command: Direct Python execution of automod_task()
Resources: 50m CPU, 128Mi memory
Concurrency: Forbid (prevent overlapping executions)

Update Check CronJob

Schedule: 0 3 * * * (daily at 3:00 AM UTC)
Command: Direct Python execution of check_for_updates_task()
Resources: 50m CPU, 128Mi memory
Concurrency: Forbid (prevent overlapping executions)

Database Cleanup CronJob (Bonus)

Schedule: 0 2 * * 0 (weekly on Sunday at 2:00 AM UTC)
Command: Django shell script to clean expired sessions and old notifications
Resources: 100m CPU, 256Mi memory
Purpose: Maintain database health (not part of original beat functionality)

Benefits of Migration

Resource Efficiency

Before: Beat container runs 24/7 consuming ~100m CPU and 256Mi memory
After: CronJobs run only when needed, typically <1 minute execution time
Savings: ~99% reduction in resource usage for periodic tasks

Operational Benefits

Kubernetes Native: Leverage built-in CronJob features (history, TTL, concurrency control)
Observability: Better visibility into job execution and failures
Scaling: No single point of failure for task scheduling
Maintenance: Easier to modify schedules without redeploying beat container

Simplified Architecture

Removes dependency on Celery beat scheduler
Reduces Redis usage (no beat schedule storage)
Eliminates one running container (reduced complexity)

Migration Steps

1. Deploy CronJobs

# Apply the new CronJob manifests
kubectl apply -f manifests/applications/bookwyrm/cronjobs.yaml

2. Verify CronJob Creation

# Check CronJobs are created
kubectl get cronjobs -n bookwyrm-application

# Check for any immediate execution (if testing)
kubectl get jobs -n bookwyrm-application

3. Monitor Execution (Run for 1-2 weeks)

# Watch job execution
kubectl get jobs -n bookwyrm-application -w

# Check job logs
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application
kubectl logs job/bookwyrm-update-check-<timestamp> -n bookwyrm-application

4. Optional: Disable Beat Container (Testing)

# Scale down beat deployment temporarily
kubectl scale deployment bookwyrm-beat --replicas=0 -n bookwyrm-application

# Monitor for any issues for several days

5. Permanent Migration

# Remove beat from kustomization.yaml
# Comment out or remove: - deployment-beat.yaml

# Apply changes
kubectl apply -k manifests/applications/bookwyrm/

6. Cleanup (Optional)

# Remove beat deployment entirely
kubectl delete deployment bookwyrm-beat -n bookwyrm-application

# Clean up database periodic tasks (if desired)
# This requires connecting to BookWyrm admin panel or database directly

Schedule Customization

Automod Schedule Adjustment

If your instance has high activity, you might want more frequent automod checks:

# For every 2 hours instead of 6:
schedule: "0 */2 * * *"

# For hourly:
schedule: "0 * * * *"

Update Check Frequency

For development instances, you might want more frequent update checks:

# For twice daily:
schedule: "0 3,15 * * *"

# For weekly instead of daily:
schedule: "0 3 * * 0"

Troubleshooting

CronJob Not Executing

# Check CronJob status
kubectl describe cronjob bookwyrm-automod -n bookwyrm-application

# Check for suspended jobs
kubectl get cronjobs -n bookwyrm-application -o wide

Job Failures

# Check failed job logs
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application

# Common issues:
# - Database connection problems
# - Missing environment variables
# - Redis connectivity issues

Missed Executions

# Check for node resource constraints
kubectl top nodes

# Verify startingDeadlineSeconds is appropriate
# Current setting: 600 seconds (10 minutes)

Rollback Plan

If issues arise, rollback is straightforward:

Scale up beat container:

kubectl scale deployment bookwyrm-beat --replicas=1 -n bookwyrm-application

Remove CronJobs:

kubectl delete cronjobs bookwyrm-automod bookwyrm-update-check -n bookwyrm-application

Restore original kustomization.yaml

Monitoring and Alerting

Consider setting up monitoring for:

CronJob execution failures
Job duration anomalies
Missing job executions
Resource usage patterns

Example Prometheus alert:

- alert: BookWyrmCronJobFailed
  expr: kube_job_status_failed{namespace="bookwyrm-application"} > 0
  for: 0m
  labels:
    severity: warning
  annotations:
    summary: "BookWyrm CronJob failed"
    description: "CronJob {{ $labels.job_name }} failed in namespace {{ $labels.namespace }}"

Conclusion

This migration replaces the continuously running Celery beat container with efficient Kubernetes CronJobs, providing the same functionality with significantly reduced resource consumption and improved operational characteristics. The migration can be done gradually with minimal risk.

7.5 KiB Raw Blame History