Running and Comparing Evaluations

This is the core workflow: launch a run, watch it execute, and review the results.

If the model you're evaluating is only available through a website (not an API), use Browser Harness instead.

What's a run?

A run is one evaluation job: one provider + one model + one dataset. It stores everything — the config, each item's prompt and response, scores, latency, and any errors.

Launching a run

Go to Runs
Click Launch run
Pick a provider and enter the exact model name
Pick a dataset
Optionally add an alias, label, or judge configuration
Click Launch run

For your first run, keep it simple: one provider, one small dataset, no variations, no custom judge.

The backend queues the run and the worker starts processing items. Progress streams in via SSE.

What you'll see during a run

The Runs page has three layers:

Active runs board — currently running and recently completed runs
Launch sheet — form for creating new runs
Archive — older completed runs

Each active run card shows:

Provider, model, and dataset
Status and elapsed time
A progress grid showing completed items
The current item being processed (with its question and response activity)

If items contain images and the provider supports vision, those images show up directly in the card.

Run statuses

Status	Meaning
`pending`	Queued, waiting for a worker to pick it up
`running`	Actively processing items
`completed`	Finished successfully
`failed`	Hit an unrecoverable error
`canceled`	Stopped before finishing

When things go wrong

Provider timeouts, rate limits, and malformed responses get handled gracefully:

Every failed attempt is logged on the run
Retry metadata stays visible
Failed items settle clearly instead of leaving the board stuck
Worker crashes are treated as recoverable when possible

Viewing results

eval_752 has three surfaces for looking at results, each optimized for a different question:

Surface	Question it answers
Dashboard	"What's happening in this workspace right now?"
Runs	"What is this specific run doing (or what did it do)?"
Comparison	"How do these completed runs differ?"

Dashboard

The workspace landing page. Shows:

Active run previews
Recent completion metrics
Comparison readiness (do you have enough completed runs to compare?)
Quick actions for creating or reviewing runs

Run Inspector

Click any run to open its detail view. Each item shows:

Prompt text
Choices and reference answer
Model response
Score, latency, and error details
Any embedded assets

This is where you debug individual items and export runs.

Comparison

For side-by-side analysis of completed runs. Open it when you have at least two finished runs and a concrete question, like:

"Did provider A score better than provider B?"
"Did the same model degrade between Monday and today?"
"Did one section regress while the overall score stayed flat?"

Comparison shows:

Accuracy and score summaries
Latency distributions
Section-level breakdowns
Alias-based grouping

#Running and Comparing Evaluations

#What's a run?

#Launching a run

#What you'll see during a run

#Run statuses

#When things go wrong

#Viewing results

#Dashboard

#Run Inspector

#Comparison