Browser Harness

Browser Harness lets you evaluate models that are only accessible through a web UI — like ChatGPT, Gemini, or any custom chat interface. It captures the model's responses from the browser and imports them into eval_752 as a normal scored run.

When to use: The model you want to test has no API. You only have access through a website.

When NOT to use: You already have an API key and provider set up. Use the normal Runs workflow instead — it's faster and more reliable.

How it works

  1. You select a dataset and target adapter in eval_752
  2. eval_752 generates a self-contained JavaScript script
  3. You run that script in the target chat page's browser console
  4. The script automates: sends prompts one at a time → waits for responses → records everything
  5. When done, the script downloads a .eval752.zip file
  6. You import that file back into eval_752
  7. The import creates a normal run that goes through standard scoring and comparison

Nothing leaves the browser. The script runs entirely client-side and doesn't send data to any third party.

What gets exported (and what doesn't)

The generated script contains only what's needed to run prompts:

✅ Included❌ Not included
Dataset identity and version hashReference answers
Prompt textChecker code
Item order and section metadataEmbedded assets
Scoring eligibility flagsFull dataset package
Signed dataset token

This means the target chat page never sees your answers or scoring logic.

Step-by-step workflow

1. Configure the export

Open Browser Harness and set up:

  • Dataset — pick the dataset to evaluate
  • Section filter and Max items — optionally limit scope
  • Target adapter — ChatGPT, Gemini, or Custom
  • Judge provider/model — if the dataset uses LLM-as-judge scoring

Export is blocked when:

  • No items match the filter
  • Items contain embedded assets (text-only for now)
  • Judge scoring is required but no judge is configured

2. Choose a target adapter

ChatGPT and Gemini have built-in selector recipes and pacing defaults. You can still override the target origin, display name, and timing.

Custom requires you to provide CSS selectors for:

  • New chat button
  • Composer text area
  • Send button
  • Assistant turn container
  • Response text

Optional selectors: model label, busy indicator, stop button.

Test with fixtures first

Use the local fixture pages linked from the Browser Harness page to validate your selectors before trying a real site. This saves time when the real site's DOM changes.

3. Generate and run the script

The page produces a raw script and a bookmarklet wrapper.

At runtime, the script:

  • Shows a lightweight overlay on the target page
  • Clicks "new chat" before each item
  • Sends one prompt at a time
  • Waits for the response to settle
  • Records timing, model label, origin, and errors
  • Downloads browser-harness-export.eval752.zip

4. Import the capture

Back in eval_752, upload the .eval752.zip through the import flow. The import creates a completed run with:

  • triggered_by = browser_harness
  • Browser target metadata on the provider record
  • All captured responses stored as normal run items
  • Scoring queued through the standard worker

After import, the run appears in Runs and (once scored) in Comparison — just like any API-based run.

Troubleshooting

ProblemSolution
Export blocked before script generationRead the preflight issue list — usually assets in items or missing judge config
"Target origin does not match"Set the exact origin (e.g., https://chatgpt.com)
Script can't find page elementsTest with local fixtures first. If fixtures work but the real site doesn't, the site's DOM changed — update your selectors
Import fails: dataset doesn't matchThe source dataset must exist in the current workspace. Re-import the dataset and try again
Run imported but not scoredScoring is async. Check Runs → inspect logs. Judge provider failures are logged without discarding the imported run