satoru/plan/backup-service.md

2.6 KiB

Satoru Backup Service Plan

Scope

Build a Linux-over-SSH backup system where Satoru pulls edge data locally, snapshots it into a local restic repo, and syncs that repo to B2.

Locked Decisions

  1. Pull model only: edge hosts never push to B2 directly.
  2. Directory targets use rsync.
  3. SQLite targets run remote .backup, compress, pull, and cleanup.
  4. Staging path: ./backups/<site_uuid>/<target_hash>/ (single persistent path per target).
  5. Site runs are background jobs; each site job is serialized, but multiple sites can run concurrently.
  6. Partial target failure does not stop the whole site job; site health becomes warning.
  7. Retention is restic-only (forget --prune), no tar archive layer.

Pipeline

  1. Preflight job:
    • SSH connectivity/auth.
    • Remote tool/path checks (rsync/sqlite3 as needed).
    • Local tool checks (ssh, rsync, restic, gzip).
    • SQLite preflight validates access/temp write capability only.
  2. Backup job:
    • Pull sqlite artifacts.
    • Pull directory targets with rsync.
    • restic backup against local staging.
    • Update health and job status (success|warning|failed).
  3. Retention job:
    • restic forget --prune per policy.
  4. Sync job:
    • restic-native sync/copy to B2 repo on schedule.

Minimal Data Model

  1. sites: site_uuid, health fields, last preflight/scan.
  2. site_targets: mode (directory|sqlite_dump), path/hash, last scan metadata.
  3. jobs: type (preflight|backup|restic_sync), status, timing, attempts.
  4. job_events: structured logs per step.
  5. sync_state: last sync status/timestamp/error.

Runtime Paths

  1. Staging: ./backups/<site_uuid>/<target_hash>/
  2. Local restic repo: ./repos/restic

Security Defaults

Recommended: 0700 directories, 0600 files, dedicated satoru system user.

Required Config

  1. staging_root
  2. restic_repo_path
  3. restic_password_file or secret source
  4. restic_retention_policy
  5. restic_sync_interval_hours
  6. restic_b2_repository
  7. restic_b2_account_id / restic_b2_account_key secret source
  8. job_worker_concurrency
  9. site_scan_interval_hours (default 24)

Build Order

  1. Phase 1: queue tables + workers + Run->background + preflight-only.
  2. Phase 2: sqlite pull + rsync pull + local restic backup.
  3. Phase 3: restic retention + scheduled B2 sync + sync health UI.
  4. Phase 4: restore UX + retries/backoff + alerts/observability.

Operational Risks

  1. Disk pressure from staging + restic repo -> enforce headroom checks.
  2. SSH/command variability -> clear per-target errors and preflight gating.
  3. Long-running jobs -> heartbeat, timeout, retry state.