Browser Harness Development Notes

This page documents the Browser Harness v1 contracts that matter to backend, frontend, and QA contributors.

The frozen product decisions for this feature live in specs/6_browser_harness_v1.md.

Scope

Browser Harness v1 is intentionally narrow:

  • text-only
  • primary-only
  • prompt-only signed packs
  • browser capture imported as a normal run

The implementation goal is reproducible browser capture without leaking answers or checker logic to the third-party page.

Core Data Contracts

Provider

Provider now distinguishes between API-backed and browser-only targets:

{
  "surface": "api",
  "browser_target": null
}
{
  "surface": "browser",
  "browser_target": {
    "preset": "chatgpt",
    "origin": "https://chatgpt.com",
    "display_name": "ChatGPT Web"
  }
}

Rules:

  • only surface="api" providers appear in provider management, smoke tests, the normal run launcher, and schedules
  • only Browser Harness import creates surface="browser" providers
  • browser-provider upsert is keyed by preset + origin

Run config

Run config now supports an explicit judge provider:

{
  "variation": {
    "enabled": false,
    "per_item": 0,
    "strategies": []
  },
  "judge": {
    "provider_id": "prov_judge",
    "model": "gpt-4o-mini",
    "prompt": "Return 0 or 1.",
    "source": "browser_harness"
  }
}

Worker behavior:

  • prefer config.judge.provider_id
  • fall back to the run provider for older runs
  • write judge failure detail without destroying the imported browser run

REST Endpoints

POST /browser-harness/packs

Builds a prompt-only signed pack for the selected dataset scope.

The response includes:

  • dataset identity and version hash
  • filtered items with prompt_text
  • section metadata
  • scoring eligibility
  • dataset_token

The response excludes:

  • answers
  • checker logic
  • embedded assets
  • the full dataset package

POST /browser-harness/imports

Imports a Browser Harness capture from either:

  • .eval752.zip
  • JSON fallback payload

Import responsibilities:

  • verify dataset_token
  • verify dataset/version still matches the current workspace
  • create or reuse a browser-only provider
  • create a completed run with triggered_by = browser_harness
  • persist run items, browser metadata, and logs
  • queue runs.score

The response reports:

  • run_id
  • provider_id
  • provider_name
  • whether the dataset was reused
  • whether scoring was queued

ZIP Package Format

Runtime ZIP exports are store-only archives containing:

  • manifest.json
  • browser_harness.json
  • results.jsonl

manifest.json

{
  "format": "eval752.browser_harness.v1",
  "exported_at": "2026-03-12T18:10:00Z",
  "item_count": 12,
  "error_count": 1
}

browser_harness.json

{
  "dataset_token": "signed-token",
  "target": {
    "preset": "gemini",
    "origin": "https://gemini.google.com",
    "display_name": "Gemini Web"
  },
  "judge": {
    "provider_id": "prov_judge",
    "model": "gpt-4o-mini",
    "prompt": "Return 0 or 1."
  },
  "session": {
    "started_at": "2026-03-12T18:09:00Z",
    "finished_at": "2026-03-12T18:10:00Z"
  },
  "label": "Gemini Web Browser Harness import"
}

results.jsonl

One JSON object per captured item:

{
  "dataset_item_id": "item-1",
  "prompt_text": "What is the capital of France?",
  "response": "Paris",
  "error": null,
  "timing": {
    "started_at": "2026-03-12T18:09:02Z",
    "finished_at": "2026-03-12T18:09:05Z",
    "latency_ms": 3120
  },
  "model_label": "Gemini 2.5 Flash"
}

Frontend Runtime

The Browser Harness page generates:

  • a raw self-contained script
  • a bookmarklet wrapping that script

The runtime:

  • validates window.location.origin
  • shows a lightweight overlay
  • clicks new chat before each item
  • fills the composer and sends the prompt
  • waits for response settle using pacing and selector rules
  • records timing, model label, and errors
  • downloads ZIP, with JSON fallback if ZIP creation fails

The last capture payload is also written to window.__EVAL752_LAST_BROWSER_HARNESS_RESULT__ for debugging.

Fixture Strategy

The repository ships three deterministic local fixtures:

  • chatgpt.html
  • gemini.html
  • custom.html

Every fixture must support:

  • new chat
  • send
  • busy / done signaling
  • assistant turn rendering
  • model label visibility

Use fixtures as the signoff target for:

  • Playwright E2E
  • playwright-interactive manual QA
  • selector debugging before touching real vendor pages

QA Expectations

Any Browser Harness change should cover:

  • backend API tests for pack/import/judge-provider selection
  • frontend tests for preflight blocking and selector validation
  • Playwright E2E for ChatGPT, Gemini, Custom, and viewport fit
  • manual browser QA using the fixture pages

The canonical manual checklist lives in docs/testing/browser-harness-signoff.md.