66 lines
2.6 KiB
Markdown
66 lines
2.6 KiB
Markdown
# Satoru Backup Service Plan
|
|
|
|
## Scope
|
|
Build a Linux-over-SSH backup system where Satoru pulls edge data locally, snapshots it into a local restic repo, and syncs that repo to B2.
|
|
|
|
## Locked Decisions
|
|
1. Pull model only: edge hosts never push to B2 directly.
|
|
2. Directory targets use `rsync`.
|
|
3. SQLite targets run remote `.backup`, compress, pull, and cleanup.
|
|
4. Staging path: `./backups/<site_uuid>/<target_hash>/` (single persistent path per target).
|
|
5. Site runs are background jobs; each site job is serialized, but multiple sites can run concurrently.
|
|
6. Partial target failure does not stop the whole site job; site health becomes `warning`.
|
|
7. Retention is restic-only (`forget --prune`), no tar archive layer.
|
|
|
|
## Pipeline
|
|
1. Preflight job:
|
|
- SSH connectivity/auth.
|
|
- Remote tool/path checks (rsync/sqlite3 as needed).
|
|
- Local tool checks (`ssh`, `rsync`, `restic`, `gzip`).
|
|
- SQLite preflight validates access/temp write capability only.
|
|
2. Backup job:
|
|
- Pull sqlite artifacts.
|
|
- Pull directory targets with rsync.
|
|
- `restic backup` against local staging.
|
|
- Update health and job status (`success|warning|failed`).
|
|
3. Retention job:
|
|
- `restic forget --prune` per policy.
|
|
4. Sync job:
|
|
- restic-native sync/copy to B2 repo on schedule.
|
|
|
|
## Minimal Data Model
|
|
1. `sites`: `site_uuid`, health fields, last preflight/scan.
|
|
2. `site_targets`: mode (`directory|sqlite_dump`), path/hash, last scan metadata.
|
|
3. `jobs`: type (`preflight|backup|restic_sync`), status, timing, attempts.
|
|
4. `job_events`: structured logs per step.
|
|
5. `sync_state`: last sync status/timestamp/error.
|
|
|
|
## Runtime Paths
|
|
1. Staging: `./backups/<site_uuid>/<target_hash>/`
|
|
2. Local restic repo: `./repos/restic`
|
|
|
|
## Security Defaults
|
|
Recommended: `0700` directories, `0600` files, dedicated `satoru` system user.
|
|
|
|
## Required Config
|
|
1. `staging_root`
|
|
2. `restic_repo_path`
|
|
3. `restic_password_file` or secret source
|
|
4. `restic_retention_policy`
|
|
5. `restic_sync_interval_hours`
|
|
6. `restic_b2_repository`
|
|
7. `restic_b2_account_id` / `restic_b2_account_key` secret source
|
|
8. `job_worker_concurrency`
|
|
9. `site_scan_interval_hours` (default 24)
|
|
|
|
## Build Order
|
|
1. Phase 1: queue tables + workers + Run->background + preflight-only.
|
|
2. Phase 2: sqlite pull + rsync pull + local restic backup.
|
|
3. Phase 3: restic retention + scheduled B2 sync + sync health UI.
|
|
4. Phase 4: restore UX + retries/backoff + alerts/observability.
|
|
|
|
## Operational Risks
|
|
1. Disk pressure from staging + restic repo -> enforce headroom checks.
|
|
2. SSH/command variability -> clear per-target errors and preflight gating.
|
|
3. Long-running jobs -> heartbeat, timeout, retry state.
|