Browser Harness Development Notes

This page documents the Browser Harness v1 contracts that matter to backend, frontend, and QA contributors.

The frozen product decisions for this feature live in specs/6_browser_harness_v1.md.

Scope

Browser Harness v1 is intentionally narrow:

text-only
primary-only
prompt-only signed packs
browser capture imported as a normal run

The implementation goal is reproducible browser capture without leaking answers or checker logic to the third-party page.

Core Data Contracts

Provider

Provider now distinguishes between API-backed and browser-only targets:

{
  "surface": "api",
  "browser_target": null
}

{
  "surface": "browser",
  "browser_target": {
    "preset": "chatgpt",
    "origin": "https://chatgpt.com",
    "display_name": "ChatGPT Web"
  }
}

Rules:

only surface="api" providers appear in provider management, smoke tests, the normal run launcher, and schedules
only Browser Harness import creates surface="browser" providers
browser-provider upsert is keyed by preset + origin

Run config

Run config now supports an explicit judge provider:

{
  "variation": {
    "enabled": false,
    "per_item": 0,
    "strategies": []
  },
  "judge": {
    "provider_id": "prov_judge",
    "model": "gpt-4o-mini",
    "prompt": "Return 0 or 1.",
    "source": "browser_harness"
  }
}

Worker behavior:

prefer config.judge.provider_id
fall back to the run provider for older runs
write judge failure detail without destroying the imported browser run

REST Endpoints

`POST /browser-harness/packs`

Builds a prompt-only signed pack for the selected dataset scope.

The response includes:

dataset identity and version hash
filtered items with prompt_text
section metadata
scoring eligibility
dataset_token

The response excludes:

answers
checker logic
embedded assets
the full dataset package

`POST /browser-harness/imports`

Imports a Browser Harness capture from either:

.eval752.zip
JSON fallback payload

Import responsibilities:

verify dataset_token
verify dataset/version still matches the current workspace
create or reuse a browser-only provider
create a completed run with triggered_by = browser_harness
persist run items, browser metadata, and logs
queue runs.score

The response reports:

run_id
provider_id
provider_name
whether the dataset was reused
whether scoring was queued

ZIP Package Format

Runtime ZIP exports are store-only archives containing:

manifest.json
browser_harness.json
results.jsonl

`manifest.json`

{
  "format": "eval752.browser_harness.v1",
  "exported_at": "2026-03-12T18:10:00Z",
  "item_count": 12,
  "error_count": 1
}

`browser_harness.json`

{
  "dataset_token": "signed-token",
  "target": {
    "preset": "gemini",
    "origin": "https://gemini.google.com",
    "display_name": "Gemini Web"
  },
  "judge": {
    "provider_id": "prov_judge",
    "model": "gpt-4o-mini",
    "prompt": "Return 0 or 1."
  },
  "session": {
    "started_at": "2026-03-12T18:09:00Z",
    "finished_at": "2026-03-12T18:10:00Z"
  },
  "label": "Gemini Web Browser Harness import"
}

`results.jsonl`

One JSON object per captured item:

{
  "dataset_item_id": "item-1",
  "prompt_text": "What is the capital of France?",
  "response": "Paris",
  "error": null,
  "timing": {
    "started_at": "2026-03-12T18:09:02Z",
    "finished_at": "2026-03-12T18:09:05Z",
    "latency_ms": 3120
  },
  "model_label": "Gemini 2.5 Flash"
}

Frontend Runtime

The Browser Harness page generates:

a raw self-contained script
a bookmarklet wrapping that script

The runtime:

validates window.location.origin
shows a lightweight overlay
clicks new chat before each item
fills the composer and sends the prompt
waits for response settle using pacing and selector rules
records timing, model label, and errors
downloads ZIP, with JSON fallback if ZIP creation fails

The last capture payload is also written to window.__EVAL752_LAST_BROWSER_HARNESS_RESULT__ for debugging.

Fixture Strategy

The repository ships three deterministic local fixtures:

chatgpt.html
gemini.html
custom.html

Every fixture must support:

new chat
send
busy / done signaling
assistant turn rendering
model label visibility

Use fixtures as the signoff target for:

Playwright E2E
playwright-interactive manual QA
selector debugging before touching real vendor pages

QA Expectations

Any Browser Harness change should cover:

backend API tests for pack/import/judge-provider selection
frontend tests for preflight blocking and selector validation
Playwright E2E for ChatGPT, Gemini, Custom, and viewport fit
manual browser QA using the fixture pages

The canonical manual checklist lives in docs/testing/browser-harness-signoff.md.

#Browser Harness Development Notes

#Scope

#Core Data Contracts

#Provider

#Run config

#REST Endpoints

#POST /browser-harness/packs

#POST /browser-harness/imports

#ZIP Package Format

#manifest.json

#browser_harness.json

#results.jsonl

#Frontend Runtime

#Fixture Strategy

#QA Expectations