Viewing Results

eval_752 has three surfaces for reviewing evaluation results. Each one answers a different question.

Which page should I use?

PageWhen to use it
Dashboard"What's happening right now?" — active runs, recent completions, workspace health
Runs"What happened in this run?" — item-level detail, logs, export
Comparison"How do these runs differ?" — side-by-side metrics, scores, and latency

Dashboard

The workspace landing page. Designed as a control room — glance at it and know the state of things.

What it shows:

  • Active run previews — compact cards for runs in progress
  • Recent completions — latest run metrics at a glance
  • Comparison readiness — tells you when you have enough completed runs to start comparing
  • Quick actions — shortcuts to create runs or examine recent results

If the dashboard says comparison isn't ready yet, stay in Runs and get another run settled first.

Run Inspector

Click any run (from the active board or archive) to open the detail view.

Each item shows:

  • Prompt text — exactly what was sent to the model
  • Choices and reference answer — for MCQ items
  • Model response — what came back
  • Score — pass/fail and scoring details
  • Latency — response time for this item
  • Error details — if something went wrong

This is your debugging surface. Use it to spot patterns: Are failures clustered in one section? Is one provider consistently slower? Are responses getting truncated?

The inspector is also where you export runs as .eval752.zip.

Comparison

For post-run analysis when you want to compare two or more completed runs side by side.

Open Comparison when you have a concrete question:

  • "Did model A score better than model B on this dataset?"
  • "Did the same provider get worse between Monday and Friday?"
  • "Did one section regress while the overall score stayed flat?"

What you'll see:

  • Accuracy summaries — overall and per-section
  • Latency distributions — how fast each provider responded
  • Score breakdowns — which sections improved or regressed
  • Alias grouping — group runs by saved model aliases
  • Retry metadata — which runs had more retried items

Export

The current export format is .eval752.zip — a complete, reproducible bundle of the run and its dataset. See Exporting Results for details.

CSV, JSON, and direct Hugging Face Hub export are planned but not yet available.