Browser Harness
Browser Harness lets you evaluate models that are only accessible through a web UI — like ChatGPT, Gemini, or any custom chat interface. It captures the model's responses from the browser and imports them into eval_752 as a normal scored run.
When to use: The model you want to test has no API. You only have access through a website.
When NOT to use: You already have an API key and provider set up. Use the normal Runs workflow instead — it's faster and more reliable.
How it works
- You select a dataset and target adapter in eval_752
- eval_752 generates a self-contained JavaScript script
- You run that script in the target chat page's browser console
- The script automates: sends prompts one at a time → waits for responses → records everything
- When done, the script downloads a
.eval752.zipfile - You import that file back into eval_752
- The import creates a normal run that goes through standard scoring and comparison
Nothing leaves the browser. The script runs entirely client-side and doesn't send data to any third party.
What gets exported (and what doesn't)
The generated script contains only what's needed to run prompts:
This means the target chat page never sees your answers or scoring logic.
Step-by-step workflow
1. Configure the export
Open Browser Harness and set up:
- Dataset — pick the dataset to evaluate
- Section filter and Max items — optionally limit scope
- Target adapter — ChatGPT, Gemini, or Custom
- Judge provider/model — if the dataset uses LLM-as-judge scoring
Export is blocked when:
- No items match the filter
- Items contain embedded assets (text-only for now)
- Judge scoring is required but no judge is configured
2. Choose a target adapter
ChatGPT and Gemini have built-in selector recipes and pacing defaults. You can still override the target origin, display name, and timing.
Custom requires you to provide CSS selectors for:
- New chat button
- Composer text area
- Send button
- Assistant turn container
- Response text
Optional selectors: model label, busy indicator, stop button.
Use the local fixture pages linked from the Browser Harness page to validate your selectors before trying a real site. This saves time when the real site's DOM changes.
3. Generate and run the script
The page produces a raw script and a bookmarklet wrapper.
At runtime, the script:
- Shows a lightweight overlay on the target page
- Clicks "new chat" before each item
- Sends one prompt at a time
- Waits for the response to settle
- Records timing, model label, origin, and errors
- Downloads
browser-harness-export.eval752.zip
4. Import the capture
Back in eval_752, upload the .eval752.zip through the import flow. The import creates a completed run with:
triggered_by = browser_harness- Browser target metadata on the provider record
- All captured responses stored as normal run items
- Scoring queued through the standard worker
After import, the run appears in Runs and (once scored) in Comparison — just like any API-based run.
