Appearance
Operations Management (OPS)
Status: implemented (maintenance windows, vacuum all, vacuum org, dry-run, mutex safety, audit records).
Purpose
OPS provides operator-level controls for stack-wide maintenance windows and data vacuum operations. It enables scheduled maintenance that blocks API Gateway traffic across all other services, full-stack data vacuum, and org-scoped data vacuum with safety guards and audit trails.
System-of-record boundaries
- OPS owns maintenance window records, vacuum records, vacuum audit records, and the secret code singleton.
- Domain data remains owned by each service; OPS vacuum operations delete derived copies and org-scoped data from those tables.
Core workflows
- Maintenance schedule: operator schedules a future maintenance window with description and duration.
- Maintenance start: scheduled maintenance is started manually or automatically by the maintenance sweep.
- Maintenance end: operator ends the active maintenance window and restores normal traffic.
- Maintenance cancel: operator cancels a scheduled (not yet started) maintenance.
- Maintenance update: operator updates description, duration, or adds progress updates to active maintenance.
- Vacuum all: operator initiates a full data vacuum across all other service tables, S3 data buckets, event/changelog/usage buckets, and CloudWatch log groups. Protected resources (
ops_main, audit logs, doc/mcp buckets) are never touched. Requires confirmation phrase. - Vacuum org: operator initiates an org-scoped data vacuum across applicable services. Skips UAS users, USM sessions, and CloudWatch.
- Vacuum cancel: operator cancels a pending vacuum within the 5-minute pending window.
- Vacuum status: read-only check on current vacuum state.
Data contracts
- All destructive operations require a secret code validated against a scrypt hash stored in DynamoDB.
- Vacuum all requires
confirmation_phrase: "VACUUM ALL DATA PERMANENTLY". - Vacuum has a 5-minute pending window before execution (cancel-safe).
- Only one vacuum (all or org) at a time via DynamoDB mutex with 24-hour TTL safety release.
- Dry-run mode reports what would be deleted without deleting.
Safety
ops_mainis never vacuumed./g3nretailstack/ops/maintenance-auditis never deleted.- Nothing prefixed
g3nmhsadminis ever touched. doc.g3nretailstack.comandmcp.g3nretailstack.combuckets are never touched.- Maintenance check fails open (OPS infra issues do not cause global outage).
- Vacuum workers retry
UnprocessedItemswith exponential backoff (up to 7 attempts, 100ms to 5s cap).
Configuration and defaults
- Maintenance mode is read by all other services via the
ACTIVE_MAINTENANCEsingleton with a 20-second cache. - OPS is exempt from its own maintenance check.
Performance posture
- Ping and stat are public, no auth required.
- Maintenance list and get require session auth.
- All mutation operations are direct Lambda invocations (not API Gateway).
Governance and roles
- Maintenance and vacuum operations require the secret code (operator-only).
- Maintenance list and get are session-gated (any authenticated member can view).
- Audit records are permanent and immutable.
Relationships and data flow
- OPS reads and writes all other service DynamoDB tables during vacuum operations.
- All services read OPS
ACTIVE_MAINTENANCEsingleton for maintenance mode checks. - OPS uses USM and UAS for session auth on read endpoints.
Example scenarios and acceptance criteria
Scenario 1: Scheduled maintenance window
- An operator schedules a maintenance window for a future time.
- The maintenance sweep automatically starts it when the time arrives.
- All other services begin returning maintenance-mode responses.
- The operator ends the maintenance window and normal traffic resumes.
- Acceptance: maintenance audit log captures schedule, start, and end events; services resume within 20 seconds of end.
Scenario 2: Org-scoped vacuum with cancel
- An operator initiates a vacuum for a specific organization.
- Within the 5-minute pending window, the operator cancels it.
- Acceptance: no data is deleted; vacuum record shows cancelled status.
Scenario 3: Full vacuum with dry-run
- An operator runs a full vacuum in dry-run mode.
- The system reports what would be deleted across all services without deleting anything.
- The operator reviews and then runs the actual vacuum with the confirmation phrase.
- Acceptance: dry-run report is accurate; actual vacuum deletes reported items; audit record is permanent.