Core Concepts

Before you start running evaluations, it helps to understand a few ideas that come up repeatedly.

These pages are not required reading to use eval_752 — the User Guide is task-oriented and gets you running immediately. But when you want to understand why something works a certain way, this is where to look.

In This Section

  • Evaluation Types — What kinds of tasks can eval_752 score, and when does each type make sense?
  • Scoring Methods — Programmatic matching vs. LLM-as-judge vs. pairwise arena — tradeoffs and when to pick each.
  • Dataset Format — How prompts, answers, and metadata are structured inside eval_752.
  • Reproducibility — What .eval752.zip bundles capture, what they don't, and how to think about evidence quality.