User Guide

This guide covers the tasks you'll do repeatedly in eval_752: setting up providers, importing datasets, running evaluations, and comparing results.

If you're new, work through these in order — each step builds on the previous one:

  1. Providers — Connect your API endpoints and verify they work
  2. Datasets — Import or build the question sets you'll evaluate against
  3. Running Evaluations — Launch runs and monitor progress
  4. Viewing Results — Understand what the Dashboard, Runs, and Comparison pages show
  5. Exporting Results — Save and share your evidence

Once the core workflow feels solid, explore:

  1. Browser Harness — Evaluate models that only exist behind a web UI
  2. Scheduled Evaluations — Automate recurring checks
  3. Advanced Features — Judge scoring, variants, and what's coming next
Tip

Don't jump into schedules or Browser Harness until you can run and compare a normal evaluation end-to-end. Debugging too many moving parts at once is frustrating.