Running and Comparing Evaluations
This is the core workflow: launch a run, watch it execute, and review the results.
If the model you're evaluating is only available through a website (not an API), use Browser Harness instead.
What's a run?
A run is one evaluation job: one provider + one model + one dataset. It stores everything — the config, each item's prompt and response, scores, latency, and any errors.
Launching a run
- Go to Runs
- Click Launch run
- Pick a provider and enter the exact model name
- Pick a dataset
- Optionally add an alias, label, or judge configuration
- Click Launch run
For your first run, keep it simple: one provider, one small dataset, no variations, no custom judge.
The backend queues the run and the worker starts processing items. Progress streams in via SSE.
What you'll see during a run
The Runs page has three layers:
- Active runs board — currently running and recently completed runs
- Launch sheet — form for creating new runs
- Archive — older completed runs
Each active run card shows:
- Provider, model, and dataset
- Status and elapsed time
- A progress grid showing completed items
- The current item being processed (with its question and response activity)
If items contain images and the provider supports vision, those images show up directly in the card.
Run statuses
When things go wrong
Provider timeouts, rate limits, and malformed responses get handled gracefully:
- Every failed attempt is logged on the run
- Retry metadata stays visible
- Failed items settle clearly instead of leaving the board stuck
- Worker crashes are treated as recoverable when possible
Viewing results
eval_752 has three surfaces for looking at results, each optimized for a different question:
Dashboard
The workspace landing page. Shows:
- Active run previews
- Recent completion metrics
- Comparison readiness (do you have enough completed runs to compare?)
- Quick actions for creating or reviewing runs
Run Inspector
Click any run to open its detail view. Each item shows:
- Prompt text
- Choices and reference answer
- Model response
- Score, latency, and error details
- Any embedded assets
This is where you debug individual items and export runs.
Comparison
For side-by-side analysis of completed runs. Open it when you have at least two finished runs and a concrete question, like:
- "Did provider A score better than provider B?"
- "Did the same model degrade between Monday and today?"
- "Did one section regress while the overall score stayed flat?"
Comparison shows:
- Accuracy and score summaries
- Latency distributions
- Section-level breakdowns
- Alias-based grouping
