237 lines
7.5 KiB
Markdown
237 lines
7.5 KiB
Markdown
# BookWyrm Celery Beat to Kubernetes CronJob Migration
|
|
|
|
## Overview
|
|
|
|
This document outlines the migration from BookWyrm's Celery beat container to Kubernetes CronJobs. The beat container currently runs continuously and schedules periodic tasks, but this can be replaced with more efficient Kubernetes-native CronJobs.
|
|
|
|
## Current Beat Container Analysis
|
|
|
|
### What Celery Beat Does
|
|
The current `deployment-beat.yaml` runs a Celery beat scheduler that:
|
|
- Uses `django_celery_beat.schedulers:DatabaseScheduler` to store schedules in the database
|
|
- Manages periodic task execution by queuing tasks to Redis for workers to pick up
|
|
- Runs continuously consuming resources (100m CPU, 256Mi memory)
|
|
|
|
### Scheduled Tasks Identified
|
|
|
|
Through analysis of the BookWyrm source code, we identified two main periodic tasks:
|
|
|
|
1. **Automod Task** (`bookwyrm.models.antispam.automod_task`)
|
|
- **Function**: Scans users and statuses for moderation flags based on AutoMod rules
|
|
- **Purpose**: Automatically flags suspicious content and users for moderator review
|
|
- **Trigger**: Only runs when AutoMod rules exist in the database
|
|
- **Recommended Schedule**: Every 6 hours (adjustable based on community size)
|
|
|
|
2. **Update Check Task** (`bookwyrm.models.site.check_for_updates_task`)
|
|
- **Function**: Checks GitHub API for new BookWyrm releases
|
|
- **Purpose**: Notifies administrators when updates are available
|
|
- **Trigger**: Makes HTTP request to GitHub releases API
|
|
- **Recommended Schedule**: Daily at 3:00 AM UTC
|
|
|
|
## Migration Strategy
|
|
|
|
### Phase 1: Parallel Operation (Recommended)
|
|
1. Deploy CronJobs alongside existing beat container
|
|
2. Monitor CronJob execution for several days
|
|
3. Verify tasks execute correctly and at expected intervals
|
|
4. Compare resource usage between approaches
|
|
|
|
### Phase 2: Beat Container Removal
|
|
1. Remove `deployment-beat.yaml` from kustomization
|
|
2. Clean up any database-stored periodic tasks (if desired)
|
|
3. Monitor for any missed functionality
|
|
|
|
## CronJob Implementation
|
|
|
|
### Key Design Decisions
|
|
|
|
1. **Direct Task Execution**: Instead of going through Celery, CronJobs execute tasks directly using Django management shell
|
|
2. **Resource Optimization**: Each job uses minimal resources (50-100m CPU, 128-256Mi memory) and only when running
|
|
3. **Security**: Same security context as other BookWyrm containers (non-root, dropped capabilities)
|
|
4. **Scheduling**: Uses standard cron expressions for predictable timing
|
|
5. **Job Management**: Configures history limits and TTL for automatic cleanup
|
|
|
|
### CronJob Specifications
|
|
|
|
#### Automod CronJob
|
|
- **Schedule**: `0 */6 * * *` (every 6 hours)
|
|
- **Command**: Direct Python execution of `automod_task()`
|
|
- **Resources**: 50m CPU, 128Mi memory
|
|
- **Concurrency**: Forbid (prevent overlapping executions)
|
|
|
|
#### Update Check CronJob
|
|
- **Schedule**: `0 3 * * *` (daily at 3:00 AM UTC)
|
|
- **Command**: Direct Python execution of `check_for_updates_task()`
|
|
- **Resources**: 50m CPU, 128Mi memory
|
|
- **Concurrency**: Forbid (prevent overlapping executions)
|
|
|
|
#### Database Cleanup CronJob (Bonus)
|
|
- **Schedule**: `0 2 * * 0` (weekly on Sunday at 2:00 AM UTC)
|
|
- **Command**: Django shell script to clean expired sessions and old notifications
|
|
- **Resources**: 100m CPU, 256Mi memory
|
|
- **Purpose**: Maintain database health (not part of original beat functionality)
|
|
|
|
## Benefits of Migration
|
|
|
|
### Resource Efficiency
|
|
- **Before**: Beat container runs 24/7 consuming ~100m CPU and 256Mi memory
|
|
- **After**: CronJobs run only when needed, typically <1 minute execution time
|
|
- **Savings**: ~99% reduction in resource usage for periodic tasks
|
|
|
|
### Operational Benefits
|
|
- **Kubernetes Native**: Leverage built-in CronJob features (history, TTL, concurrency control)
|
|
- **Observability**: Better visibility into job execution and failures
|
|
- **Scaling**: No single point of failure for task scheduling
|
|
- **Maintenance**: Easier to modify schedules without redeploying beat container
|
|
|
|
### Simplified Architecture
|
|
- Removes dependency on Celery beat scheduler
|
|
- Reduces Redis usage (no beat schedule storage)
|
|
- Eliminates one running container (reduced complexity)
|
|
|
|
## Migration Steps
|
|
|
|
### 1. Deploy CronJobs
|
|
```bash
|
|
# Apply the new CronJob manifests
|
|
kubectl apply -f manifests/applications/bookwyrm/cronjobs.yaml
|
|
```
|
|
|
|
### 2. Verify CronJob Creation
|
|
```bash
|
|
# Check CronJobs are created
|
|
kubectl get cronjobs -n bookwyrm-application
|
|
|
|
# Check for any immediate execution (if testing)
|
|
kubectl get jobs -n bookwyrm-application
|
|
```
|
|
|
|
### 3. Monitor Execution (Run for 1-2 weeks)
|
|
```bash
|
|
# Watch job execution
|
|
kubectl get jobs -n bookwyrm-application -w
|
|
|
|
# Check job logs
|
|
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application
|
|
kubectl logs job/bookwyrm-update-check-<timestamp> -n bookwyrm-application
|
|
```
|
|
|
|
### 4. Optional: Disable Beat Container (Testing)
|
|
```bash
|
|
# Scale down beat deployment temporarily
|
|
kubectl scale deployment bookwyrm-beat --replicas=0 -n bookwyrm-application
|
|
|
|
# Monitor for any issues for several days
|
|
```
|
|
|
|
### 5. Permanent Migration
|
|
```bash
|
|
# Remove beat from kustomization.yaml
|
|
# Comment out or remove: - deployment-beat.yaml
|
|
|
|
# Apply changes
|
|
kubectl apply -k manifests/applications/bookwyrm/
|
|
```
|
|
|
|
### 6. Cleanup (Optional)
|
|
```bash
|
|
# Remove beat deployment entirely
|
|
kubectl delete deployment bookwyrm-beat -n bookwyrm-application
|
|
|
|
# Clean up database periodic tasks (if desired)
|
|
# This requires connecting to BookWyrm admin panel or database directly
|
|
```
|
|
|
|
## Schedule Customization
|
|
|
|
### Automod Schedule Adjustment
|
|
If your instance has high activity, you might want more frequent automod checks:
|
|
```yaml
|
|
# For every 2 hours instead of 6:
|
|
schedule: "0 */2 * * *"
|
|
|
|
# For hourly:
|
|
schedule: "0 * * * *"
|
|
```
|
|
|
|
### Update Check Frequency
|
|
For development instances, you might want more frequent update checks:
|
|
```yaml
|
|
# For twice daily:
|
|
schedule: "0 3,15 * * *"
|
|
|
|
# For weekly instead of daily:
|
|
schedule: "0 3 * * 0"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### CronJob Not Executing
|
|
```bash
|
|
# Check CronJob status
|
|
kubectl describe cronjob bookwyrm-automod -n bookwyrm-application
|
|
|
|
# Check for suspended jobs
|
|
kubectl get cronjobs -n bookwyrm-application -o wide
|
|
```
|
|
|
|
### Job Failures
|
|
```bash
|
|
# Check failed job logs
|
|
kubectl logs job/bookwyrm-automod-<timestamp> -n bookwyrm-application
|
|
|
|
# Common issues:
|
|
# - Database connection problems
|
|
# - Missing environment variables
|
|
# - Redis connectivity issues
|
|
```
|
|
|
|
### Missed Executions
|
|
```bash
|
|
# Check for node resource constraints
|
|
kubectl top nodes
|
|
|
|
# Verify startingDeadlineSeconds is appropriate
|
|
# Current setting: 600 seconds (10 minutes)
|
|
```
|
|
|
|
## Rollback Plan
|
|
|
|
If issues arise, rollback is straightforward:
|
|
|
|
1. **Scale up beat container**:
|
|
```bash
|
|
kubectl scale deployment bookwyrm-beat --replicas=1 -n bookwyrm-application
|
|
```
|
|
|
|
2. **Remove CronJobs**:
|
|
```bash
|
|
kubectl delete cronjobs bookwyrm-automod bookwyrm-update-check -n bookwyrm-application
|
|
```
|
|
|
|
3. **Restore original kustomization.yaml**
|
|
|
|
## Monitoring and Alerting
|
|
|
|
Consider setting up monitoring for:
|
|
- CronJob execution failures
|
|
- Job duration anomalies
|
|
- Missing job executions
|
|
- Resource usage patterns
|
|
|
|
Example Prometheus alert:
|
|
```yaml
|
|
- alert: BookWyrmCronJobFailed
|
|
expr: kube_job_status_failed{namespace="bookwyrm-application"} > 0
|
|
for: 0m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "BookWyrm CronJob failed"
|
|
description: "CronJob {{ $labels.job_name }} failed in namespace {{ $labels.namespace }}"
|
|
```
|
|
|
|
## Conclusion
|
|
|
|
This migration replaces the continuously running Celery beat container with efficient Kubernetes CronJobs, providing the same functionality with significantly reduced resource consumption and improved operational characteristics. The migration can be done gradually with minimal risk.
|