Viewing Results
eval_752 has three surfaces for reviewing evaluation results. Each one answers a different question.
Which page should I use?
Dashboard
The workspace landing page. Designed as a control room — glance at it and know the state of things.
What it shows:
- Active run previews — compact cards for runs in progress
- Recent completions — latest run metrics at a glance
- Comparison readiness — tells you when you have enough completed runs to start comparing
- Quick actions — shortcuts to create runs or examine recent results
If the dashboard says comparison isn't ready yet, stay in Runs and get another run settled first.
Run Inspector
Click any run (from the active board or archive) to open the detail view.
Each item shows:
- Prompt text — exactly what was sent to the model
- Choices and reference answer — for MCQ items
- Model response — what came back
- Score — pass/fail and scoring details
- Latency — response time for this item
- Error details — if something went wrong
This is your debugging surface. Use it to spot patterns: Are failures clustered in one section? Is one provider consistently slower? Are responses getting truncated?
The inspector is also where you export runs as .eval752.zip.
Comparison
For post-run analysis when you want to compare two or more completed runs side by side.
Open Comparison when you have a concrete question:
- "Did model A score better than model B on this dataset?"
- "Did the same provider get worse between Monday and Friday?"
- "Did one section regress while the overall score stayed flat?"
What you'll see:
- Accuracy summaries — overall and per-section
- Latency distributions — how fast each provider responded
- Score breakdowns — which sections improved or regressed
- Alias grouping — group runs by saved model aliases
- Retry metadata — which runs had more retried items
Export
The current export format is .eval752.zip — a complete, reproducible bundle of the run and its dataset. See Exporting Results for details.
CSV, JSON, and direct Hugging Face Hub export are planned but not yet available.
