Viewing Results

eval_752 has three surfaces for reviewing evaluation results. Each one answers a different question.

Which page should I use?

Page	When to use it
Dashboard	"What's happening right now?" — active runs, recent completions, workspace health
Runs	"What happened in this run?" — item-level detail, logs, export
Comparison	"How do these runs differ?" — side-by-side metrics, scores, and latency

Dashboard

The workspace landing page. Designed as a control room — glance at it and know the state of things.

What it shows:

Active run previews — compact cards for runs in progress
Recent completions — latest run metrics at a glance
Comparison readiness — tells you when you have enough completed runs to start comparing
Quick actions — shortcuts to create runs or examine recent results

If the dashboard says comparison isn't ready yet, stay in Runs and get another run settled first.

Run Inspector

Click any run (from the active board or archive) to open the detail view.

Each item shows:

Prompt text — exactly what was sent to the model
Choices and reference answer — for MCQ items
Model response — what came back
Score — pass/fail and scoring details
Latency — response time for this item
Error details — if something went wrong

This is your debugging surface. Use it to spot patterns: Are failures clustered in one section? Is one provider consistently slower? Are responses getting truncated?

The inspector is also where you export runs as .eval752.zip.

Comparison

For post-run analysis when you want to compare two or more completed runs side by side.

Open Comparison when you have a concrete question:

"Did model A score better than model B on this dataset?"
"Did the same provider get worse between Monday and Friday?"
"Did one section regress while the overall score stayed flat?"

What you'll see:

Accuracy summaries — overall and per-section
Latency distributions — how fast each provider responded
Score breakdowns — which sections improved or regressed
Alias grouping — group runs by saved model aliases
Retry metadata — which runs had more retried items

Export

The current export format is .eval752.zip — a complete, reproducible bundle of the run and its dataset. See Exporting Results for details.

CSV, JSON, and direct Hugging Face Hub export are planned but not yet available.

#Viewing Results

#Which page should I use?

#Dashboard

#Run Inspector