satoru/plan/backup-service.md

66 lines
2.6 KiB
Markdown

# Satoru Backup Service Plan
## Scope
Build a Linux-over-SSH backup system where Satoru pulls edge data locally, snapshots it into a local restic repo, and syncs that repo to B2.
## Locked Decisions
1. Pull model only: edge hosts never push to B2 directly.
2. Directory targets use `rsync`.
3. SQLite targets run remote `.backup`, compress, pull, and cleanup.
4. Staging path: `./backups/<site_uuid>/<target_hash>/` (single persistent path per target).
5. Site runs are background jobs; each site job is serialized, but multiple sites can run concurrently.
6. Partial target failure does not stop the whole site job; site health becomes `warning`.
7. Retention is restic-only (`forget --prune`), no tar archive layer.
## Pipeline
1. Preflight job:
- SSH connectivity/auth.
- Remote tool/path checks (rsync/sqlite3 as needed).
- Local tool checks (`ssh`, `rsync`, `restic`, `gzip`).
- SQLite preflight validates access/temp write capability only.
2. Backup job:
- Pull sqlite artifacts.
- Pull directory targets with rsync.
- `restic backup` against local staging.
- Update health and job status (`success|warning|failed`).
3. Retention job:
- `restic forget --prune` per policy.
4. Sync job:
- restic-native sync/copy to B2 repo on schedule.
## Minimal Data Model
1. `sites`: `site_uuid`, health fields, last preflight/scan.
2. `site_targets`: mode (`directory|sqlite_dump`), path/hash, last scan metadata.
3. `jobs`: type (`preflight|backup|restic_sync`), status, timing, attempts.
4. `job_events`: structured logs per step.
5. `sync_state`: last sync status/timestamp/error.
## Runtime Paths
1. Staging: `./backups/<site_uuid>/<target_hash>/`
2. Local restic repo: `./repos/restic`
## Security Defaults
Recommended: `0700` directories, `0600` files, dedicated `satoru` system user.
## Required Config
1. `staging_root`
2. `restic_repo_path`
3. `restic_password_file` or secret source
4. `restic_retention_policy`
5. `restic_sync_interval_hours`
6. `restic_b2_repository`
7. `restic_b2_account_id` / `restic_b2_account_key` secret source
8. `job_worker_concurrency`
9. `site_scan_interval_hours` (default 24)
## Build Order
1. Phase 1: queue tables + workers + Run->background + preflight-only.
2. Phase 2: sqlite pull + rsync pull + local restic backup.
3. Phase 3: restic retention + scheduled B2 sync + sync health UI.
4. Phase 4: restore UX + retries/backoff + alerts/observability.
## Operational Risks
1. Disk pressure from staging + restic repo -> enforce headroom checks.
2. SSH/command variability -> clear per-target errors and preflight gating.
3. Long-running jobs -> heartbeat, timeout, retry state.