Browser Harness

Browser Harness lets you evaluate models that are only accessible through a web UI — like ChatGPT, Gemini, or any custom chat interface. It captures the model's responses from the browser and imports them into eval_752 as a normal scored run.

When to use: The model you want to test has no API. You only have access through a website.

When NOT to use: You already have an API key and provider set up. Use the normal Runs workflow instead — it's faster and more reliable.

How it works

You select a dataset and target adapter in eval_752
eval_752 generates a self-contained JavaScript script
You run that script in the target chat page's browser console
The script automates: sends prompts one at a time → waits for responses → records everything
When done, the script downloads a .eval752.zip file
You import that file back into eval_752
The import creates a normal run that goes through standard scoring and comparison

Nothing leaves the browser. The script runs entirely client-side and doesn't send data to any third party.

What gets exported (and what doesn't)

The generated script contains only what's needed to run prompts:

✅ Included	❌ Not included
Dataset identity and version hash	Reference answers
Prompt text	Checker code
Item order and section metadata	Embedded assets
Scoring eligibility flags	Full dataset package
Signed dataset token

This means the target chat page never sees your answers or scoring logic.

Step-by-step workflow

1. Configure the export

Open Browser Harness and set up:

Dataset — pick the dataset to evaluate
Section filter and Max items — optionally limit scope
Target adapter — ChatGPT, Gemini, or Custom
Judge provider/model — if the dataset uses LLM-as-judge scoring

Export is blocked when:

No items match the filter
Items contain embedded assets (text-only for now)
Judge scoring is required but no judge is configured

2. Choose a target adapter

ChatGPT and Gemini have built-in selector recipes and pacing defaults. You can still override the target origin, display name, and timing.

Custom requires you to provide CSS selectors for:

New chat button
Composer text area
Send button
Assistant turn container
Response text

Optional selectors: model label, busy indicator, stop button.

Test with fixtures first

Use the local fixture pages linked from the Browser Harness page to validate your selectors before trying a real site. This saves time when the real site's DOM changes.

3. Generate and run the script

The page produces a raw script and a bookmarklet wrapper.

At runtime, the script:

Shows a lightweight overlay on the target page
Clicks "new chat" before each item
Sends one prompt at a time
Waits for the response to settle
Records timing, model label, origin, and errors
Downloads browser-harness-export.eval752.zip

4. Import the capture

Back in eval_752, upload the .eval752.zip through the import flow. The import creates a completed run with:

triggered_by = browser_harness
Browser target metadata on the provider record
All captured responses stored as normal run items
Scoring queued through the standard worker

After import, the run appears in Runs and (once scored) in Comparison — just like any API-based run.

Troubleshooting

Problem	Solution
Export blocked before script generation	Read the preflight issue list — usually assets in items or missing judge config
"Target origin does not match"	Set the exact origin (e.g., `https://chatgpt.com`)
Script can't find page elements	Test with local fixtures first. If fixtures work but the real site doesn't, the site's DOM changed — update your selectors
Import fails: dataset doesn't match	The source dataset must exist in the current workspace. Re-import the dataset and try again
Run imported but not scored	Scoring is async. Check Runs → inspect logs. Judge provider failures are logged without discarding the imported run

#Browser Harness

#How it works

#What gets exported (and what doesn't)

#Step-by-step workflow

#1. Configure the export

#2. Choose a target adapter

#3. Generate and run the script

#4. Import the capture

#Troubleshooting