Security & Secret Management Playbook

Last updated: 2025-11-10 · Owner: Security/DevOps

This is the single source of truth for security guidance for the Python/FastAPI + React stack.

1. Threat Model Snapshot

AssetPrimary RisksMitigations
Provider API keys (encrypted at rest)Database theft, log leakage, memory scrapingAES-GCM encryption with out-of-band ENCRYPTION_KEY, envelope secrets backed by HSM/vault, structured logging with secrets redaction.
Evaluation results & datasetsUnauthorized access, tampering, deanonymizationRole-based access (future), signed exports, access logs, backups with integrity checks.
Control plane (FastAPI, Celery, CLI)RCE via dependency bugs, SSRF via provider webhooks, leaking SSE channelsKeep uv.lock in sync after docs/toolchain bumps, pin dependencies, run services as non-root eval752 user, enforce request size/timeouts, restrict SSE origin via reverse proxy.
Infrastructure secretsDrift between Compose/K8s, manual copy/paste errorsStore .env.prod only in vaults (AWS Secrets Manager, Doppler, etc.), inject via CI or orchestrator secret stores, never commit to repo.

2. Provider Secrets Lifecycle

  1. Storage — Provider API keys are encrypted with AES-GCM before hitting Postgres. ENCRYPTION_KEY must be a 32-byte hex string shared across backend, Celery worker, and Celery beat.
  2. Rotation cadence — Rotate at least monthly or immediately after an incident. Use PATCH /providers/{id} to replace the encrypted payload atomically, then invalidate the old key at the upstream provider.
  3. Backups — Back up ENCRYPTION_KEY separately from database dumps. Store one copy in a hardware password manager and another inside your cloud secret vault. Document rotations in specs/4_notes.md.
  4. Access boundaries — Only backend and celery-worker containers should receive provider secrets. Frontend builds never embed API keys; use the in-app Provider manager instead.

3. Application Credentials Matrix

ScopeVariable(s)Hardening Notes
PostgresDATABASE_URL, POSTGRES_PASSWORDUse strong passwords, enable TLS where available, restrict network to app subnets.
Redis / CeleryREDIS_URLPrefer redis://:password@host:6379/0, disable anonymous access, rotate when running incident playbooks.
Hugging FaceHF_TOKENInject via secrets manager; audit usage logs if leak suspected.
Workspace runtime policySettings → runtime controlsLeave provider credentials in DB only; timeout / retry / cancellation policy is now stored in the database and applied to future runs.
Prometheus scrapePROMETHEUS_PASSWORDAdd basic auth or IP allow-lists when exposing /metrics beyond localhost.
Telemetry endpointsSENTRY_DSN, OTEL_EXPORTER_*Optional, but treat as secrets where applicable.

Document all new variables in Configuration, update .env.example, and mention them in release notes.

4. Network & Platform Controls

  • Segmentation — Expose FastAPI via reverse proxy (Traefik/Caddy/Nginx). Only allow inbound traffic on ports 80/443 (frontend/proxy) and 22 (SSH) where required. Block direct access to Postgres/Redis from the public internet.
  • TLS — Terminate HTTPS at the proxy or Kubernetes Ingress with managed certificates (ACM, Cert-Manager). Enforce HTTPS via 301 redirect and set HSTS (min-age ≥ 1 day during testing, ≥ 6 months in prod once stable).
  • Headers — Ensure Strict-Transport-Security, Content-Security-Policy 'self' data: blob:, X-Frame-Options DENY, and Referrer-Policy strict-origin-when-cross-origin are enabled. Track header coverage in specs/4_notes.md.
  • Containers — All Docker images run as user eval752. Keep bind-mounted directories owned by this UID or writable via group perms. Enable read-only root filesystems for frontend containers if your orchestrator supports it.

5. Operational Guardrails

  • Run uvx pre-commit run -a locally before pushing to ensure secret scanners and linters operate on the same tree CI will validate.
  • CI pipelines should retrieve all secrets from repository/environment secrets (never store plaintext values in workflow YAML).
  • Replace the placeholder .env values in any non-disposable environment before storing real provider keys.
  • Prefer infrastructure secret stores (AWS Secrets Manager, Azure Key Vault, Doppler, 1Password Connect) even when using Docker Compose; mount .env files generated by automation rather than hand-editing servers.

6. Incident Response Checklist

  1. Rotate ENCRYPTION_KEY (requires maintenance window to re-encrypt provider keys). Use Celery beat pause to avoid race conditions.
  2. Rotate affected provider credentials inside eval_752, then revoke upstream keys.
  3. Revoke HF/LiteLLM tokens and regenerate service accounts used in CI or automation.
  4. Rebuild Docker images, redeploy, and confirm no residual pods/containers reference the compromised secrets.
  5. Capture timelines and mitigations in specs/4_notes.md, then open/close follow-up tasks in specs/3_tasks.md.
  6. Run scripts/tests/run_docker_integration.sh --full-run to ensure baseline functionality post-incident.

7. Audit Checklist (Quarterly)

  • Verify every environment uses unique Postgres + Redis credentials.
  • Confirm vault/backups contain the current ENCRYPTION_KEY.
  • Review CI configs for plaintext secrets.
  • Spot check logs to ensure secrets are redacted.
  • Run container and dependency scans (Grype/Trivy + pip-audit) and capture results.