Skip to content

Operations Management (OPS)

Status: implemented (maintenance windows, vacuum all, vacuum org, dry-run, mutex safety, audit records).

Purpose

OPS provides operator-level controls for stack-wide maintenance windows and data vacuum operations. It enables scheduled maintenance that blocks API Gateway traffic across all other services, full-stack data vacuum, and org-scoped data vacuum with safety guards and audit trails.

System-of-record boundaries

  • OPS owns maintenance window records, vacuum records, vacuum audit records, and the secret code singleton.
  • Domain data remains owned by each service; OPS vacuum operations delete derived copies and org-scoped data from those tables.

Core workflows

  • Maintenance schedule: operator schedules a future maintenance window with description and duration.
  • Maintenance start: scheduled maintenance is started manually or automatically by the maintenance sweep.
  • Maintenance end: operator ends the active maintenance window and restores normal traffic.
  • Maintenance cancel: operator cancels a scheduled (not yet started) maintenance.
  • Maintenance update: operator updates description, duration, or adds progress updates to active maintenance.
  • Vacuum all: operator initiates a full data vacuum across all other service tables, S3 data buckets, event/changelog/usage buckets, and CloudWatch log groups. Protected resources (ops_main, audit logs, doc/mcp buckets) are never touched. Requires confirmation phrase.
  • Vacuum org: operator initiates an org-scoped data vacuum across applicable services. Skips UAS users, USM sessions, and CloudWatch.
  • Vacuum cancel: operator cancels a pending vacuum within the 5-minute pending window.
  • Vacuum status: read-only check on current vacuum state.

Data contracts

  • All destructive operations require a secret code validated against a scrypt hash stored in DynamoDB.
  • Vacuum all requires confirmation_phrase: "VACUUM ALL DATA PERMANENTLY".
  • Vacuum has a 5-minute pending window before execution (cancel-safe).
  • Only one vacuum (all or org) at a time via DynamoDB mutex with 24-hour TTL safety release.
  • Dry-run mode reports what would be deleted without deleting.

Safety

  • ops_main is never vacuumed.
  • /g3nretailstack/ops/maintenance-audit is never deleted.
  • Nothing prefixed g3nmhsadmin is ever touched.
  • doc.g3nretailstack.com and mcp.g3nretailstack.com buckets are never touched.
  • Maintenance check fails open (OPS infra issues do not cause global outage).
  • Vacuum workers retry UnprocessedItems with exponential backoff (up to 7 attempts, 100ms to 5s cap).

Configuration and defaults

  • Maintenance mode is read by all other services via the ACTIVE_MAINTENANCE singleton with a 20-second cache.
  • OPS is exempt from its own maintenance check.

Performance posture

  • Ping and stat are public, no auth required.
  • Maintenance list and get require session auth.
  • All mutation operations are direct Lambda invocations (not API Gateway).

Governance and roles

  • Maintenance and vacuum operations require the secret code (operator-only).
  • Maintenance list and get are session-gated (any authenticated member can view).
  • Audit records are permanent and immutable.

Relationships and data flow

  • OPS reads and writes all other service DynamoDB tables during vacuum operations.
  • All services read OPS ACTIVE_MAINTENANCE singleton for maintenance mode checks.
  • OPS uses USM and UAS for session auth on read endpoints.

Example scenarios and acceptance criteria

Scenario 1: Scheduled maintenance window

  • An operator schedules a maintenance window for a future time.
  • The maintenance sweep automatically starts it when the time arrives.
  • All other services begin returning maintenance-mode responses.
  • The operator ends the maintenance window and normal traffic resumes.
  • Acceptance: maintenance audit log captures schedule, start, and end events; services resume within 20 seconds of end.

Scenario 2: Org-scoped vacuum with cancel

  • An operator initiates a vacuum for a specific organization.
  • Within the 5-minute pending window, the operator cancels it.
  • Acceptance: no data is deleted; vacuum record shows cancelled status.

Scenario 3: Full vacuum with dry-run

  • An operator runs a full vacuum in dry-run mode.
  • The system reports what would be deleted across all services without deleting anything.
  • The operator reviews and then runs the actual vacuum with the confirmation phrase.
  • Acceptance: dry-run report is accurate; actual vacuum deletes reported items; audit record is permanent.