Running and Comparing Evaluations

This is the core workflow: launch a run, watch it execute, and review the results.

If the model you're evaluating is only available through a website (not an API), use Browser Harness instead.

What's a run?

A run is one evaluation job: one provider + one model + one dataset. It stores everything — the config, each item's prompt and response, scores, latency, and any errors.

Launching a run

  1. Go to Runs
  2. Click Launch run
  3. Pick a provider and enter the exact model name
  4. Pick a dataset
  5. Optionally add an alias, label, or judge configuration
  6. Click Launch run

For your first run, keep it simple: one provider, one small dataset, no variations, no custom judge.

The backend queues the run and the worker starts processing items. Progress streams in via SSE.

What you'll see during a run

The Runs page has three layers:

  • Active runs board — currently running and recently completed runs
  • Launch sheet — form for creating new runs
  • Archive — older completed runs

Each active run card shows:

  • Provider, model, and dataset
  • Status and elapsed time
  • A progress grid showing completed items
  • The current item being processed (with its question and response activity)

If items contain images and the provider supports vision, those images show up directly in the card.

Run statuses

StatusMeaning
pendingQueued, waiting for a worker to pick it up
runningActively processing items
completedFinished successfully
failedHit an unrecoverable error
canceledStopped before finishing

When things go wrong

Provider timeouts, rate limits, and malformed responses get handled gracefully:

  • Every failed attempt is logged on the run
  • Retry metadata stays visible
  • Failed items settle clearly instead of leaving the board stuck
  • Worker crashes are treated as recoverable when possible

Viewing results

eval_752 has three surfaces for looking at results, each optimized for a different question:

SurfaceQuestion it answers
Dashboard"What's happening in this workspace right now?"
Runs"What is this specific run doing (or what did it do)?"
Comparison"How do these completed runs differ?"

Dashboard

The workspace landing page. Shows:

  • Active run previews
  • Recent completion metrics
  • Comparison readiness (do you have enough completed runs to compare?)
  • Quick actions for creating or reviewing runs

Run Inspector

Click any run to open its detail view. Each item shows:

  • Prompt text
  • Choices and reference answer
  • Model response
  • Score, latency, and error details
  • Any embedded assets

This is where you debug individual items and export runs.

Comparison

For side-by-side analysis of completed runs. Open it when you have at least two finished runs and a concrete question, like:

  • "Did provider A score better than provider B?"
  • "Did the same model degrade between Monday and today?"
  • "Did one section regress while the overall score stayed flat?"

Comparison shows:

  • Accuracy and score summaries
  • Latency distributions
  • Section-level breakdowns
  • Alias-based grouping