Skip to content

Playbooks

Playbooks are standard operating procedures (SOPs) for OPS. Use calls.md for API Gateway payload shapes and RECAP.md for direct Lambda operations.

Surface availability (explicit)

  • API Gateway: Available (ping, stat, maintenance list/get).
  • Direct Lambda: Available (maintenance schedule/cancel/start/end/update, vacuum all/org/cancel/status).
  • CLI: Available (g3n ops ..., API Gateway + direct Lambdas).
  • MCP: Available.

Playbook: Maintenance window lifecycle

Goal: Schedule, execute, and end a maintenance window across all services.

Why this sequence:

  • Maintenance mode blocks API Gateway traffic across 14 services (fails open).
  • A controlled lifecycle ensures proper communication and recovery.

Preconditions

  • Secret code (set at initial deploy, stored as scrypt hash in DynamoDB).

SOP (happy path)

  1. Schedule maintenance (g3n ops maintenance-schedule --secret-code $CODE --description "..." --duration 3600 --start "2026-03-01T02:00:00Z").
    • Reason: creates a maintenance record in scheduled state.
  2. Start maintenance (g3n ops maintenance-start --secret-code $CODE --maintenance-id $ID).
    • Reason: activates maintenance mode; all services start returning 503.
  3. Post updates (g3n ops maintenance-update --secret-code $CODE --text "Phase 1 complete").
    • Reason: provides progress updates visible via ping and maintenance/get.
  4. End maintenance (g3n ops maintenance-end --secret-code $CODE --end-message "Complete").
    • Reason: deactivates maintenance mode; services recover within 20s cache TTL.

Outputs

  • Maintenance record with full lifecycle history.
  • Audit log entry in /g3nretailstack/ops/maintenance-audit.

Failure modes / remediation

  • secret-code-mismatch: verify the correct secret code.
  • Services not recovering: maintenance check fails open; if OPS infra is down, services continue operating.

Playbook: Vacuum all (dry-run first)

Goal: Purge all data from the stack (development/testing reset).

Why this sequence:

  • Vacuum is destructive and irreversible. Always dry-run first.
  • 5-minute pending window allows cancellation.

Preconditions

  • Secret code.
  • No active vacuum (mutex enforced).

SOP (happy path)

  1. Dry-run (g3n ops vacuum-all --secret-code $CODE --reason "Reset" --confirmation-phrase "VACUUM ALL DATA PERMANENTLY" --dry-run).
    • Reason: reports what would be deleted without deleting.
  2. Review dry-run results (g3n ops vacuum-status --vacuum-id $VID).
  3. Execute (g3n ops vacuum-all --secret-code $CODE --reason "Reset" --confirmation-phrase "VACUUM ALL DATA PERMANENTLY").
    • Reason: starts the vacuum with 5-minute pending window.
  4. Monitor (g3n ops vacuum-status --vacuum-id $VID).

Outputs

  • Per-service deletion stats (items, objects, bytes).
  • Audit record in DynamoDB.

Failure modes / remediation

  • vacuum-mutex-locked: another vacuum is running; wait or cancel it.
  • Cancel during pending: g3n ops vacuum-cancel --secret-code $CODE --vacuum-id $VID.

Playbook: Vacuum org

Goal: Purge all data for a specific organization.

Preconditions

  • Secret code.
  • Valid orgcode.

SOP (happy path)

  1. Dry-run (g3n ops vacuum-org --secret-code $CODE --orgcode TESTORG --reason "Cleanup" --dry-run).
  2. Review then execute without --dry-run.

Notes

  • Skips UAS users and USM sessions (user accounts are cross-org).
  • Skips CloudWatch log groups.

Cross-service relationships

  • 20 service tables: OPS coordinates vacuum across the entire stack.