Docker Integration Test Strategy

This document covers the Docker-backed integration suite for the full run lifecycle inside the containerized stack.

Objectives

  • prove API, worker, beat, PostgreSQL, Redis, and the local OpenAI-compatible test gateway cooperate correctly
  • validate run creation, execution, scoring, retries, and SSE updates
  • keep fixtures deterministic and reusable in CI and local debugging

Quick Start

scripts/tests/run_docker_integration.sh --full-run --fresh --with-beat

The helper:

  1. creates or reuses .env.integration
  2. starts postgres, redis, the integration-only local OpenAI-compatible gateway, celery-worker, and optional celery-beat
  3. runs migrations
  4. seeds deterministic provider and dataset fixtures
  5. executes pytest -m integration
  6. collects logs and coverage under .artifacts/docker-integration/

Topology

  • postgres
  • redis
  • local OpenAI-compatible test gateway from docker-compose.integration.yml (compose service: fake-openai)
  • celery-worker
  • optional celery-beat
  • backend test container

Scenario Matrix

IDScenarioNotes
INT-001smoke run completesprimary completion gate
INT-002provider failure surfacesuses an always-fail model name
INT-003SSE stream integritylimited by TestClient, covered mainly in unit and browser flows
INT-004retry + beatuses fail-once models and beat-driven retry dispatch

Test Gateway Notes

Integration runs do not depend on external API keys.

Instead, the stack seeds a local OpenAI-compatible test provider. The app still uses the normal LiteLLM client path; only the provider endpoint is local and deterministic.

Model-name conventions:

  • gpt-3.5-turbo succeeds
  • *-fail-once-* fails once, then succeeds
  • *-always-fail-* fails every time

Local Developer Workflow

scripts/tests/run_docker_integration.sh --up --with-beat
scripts/tests/run_docker_integration.sh --test integration
scripts/tests/run_docker_integration.sh --down

Use docker compose logs backend, docker compose logs celery-worker, and the artifact directory when debugging failures.