Runbooks
This document provides runbooks for common operational procedures including releases, incidents, backups, and upgrades.
Release Train Cut
Section titled “Release Train Cut”We follow a structured release train approach:
-
Trigger Release Please
- Merge
maininto release branch. - Let Release Please generate changelog + bump version.
- Merge
-
Tag & Artifacts
- CI tags repo with version (e.g.
v1.2.3). - CI builds container images and pushes to registry.
- CI tags repo with version (e.g.
-
Docs Versioning
- Use
miketo version and publish docs:Terminal window mike deploy 1.2 latestmike set-default latestgit push origin gh-pages
- Use
Incident Response
Section titled “Incident Response”SSE Stream Stuck
Section titled “SSE Stream Stuck”- Check API logs for errors around
/stream. - Restart affected API pods.
- Validate NATS connection health.
API Server Crash
Section titled “API Server Crash”- Inspect server pod/container logs.
- Restart pod in Kubernetes.
- Verify database connectivity.
Event Queue Backlog
Section titled “Event Queue Backlog”- Monitor NATS JetStream metrics.
- Check outbox relay worker logs.
- Scale API server instances if needed.
Backups & Restores
Section titled “Backups & Restores”PostgreSQL
Section titled “PostgreSQL”- Backups: cron
pg_dumpto S3. - Restore:
psql < dump.sqlinto new instance.
Event Store
Section titled “Event Store”- Event store is append-only, backup with regular pg_dump.
- Point-in-time recovery via PostgreSQL WAL archiving.
Upgrade Checklist
Section titled “Upgrade Checklist”- Run database migrations before deploying new code.
- Test migrations on staging environment first.
- Validate API compatibility with existing clients.
- Monitor error rates after deployment.
- Keep rollback plan ready.