Evaluation Types

eval_752 supports several question formats. Each type determines how items are structured in a dataset and how they get scored.

At a Glance

TypeWhat it testsScoringExample benchmarks
MCQ (single)Pick one correct answer from a listExact match (A/B/C/D)MMLU, ARC, HellaSwag
MCQ (multiple)Pick all correct answersPartial credit (set F1)Custom knowledge tests
Freeform textOpen-ended generationLLM-as-judge or regexWriting tasks, explanations
Code generationWrite working codeTest execution (pass/fail)HumanEval, MBPP
Pairwise (arena)Compare two responsesBradley-Terry / Elo rankingQuality comparisons

MCQ — Single Choice

The most common type. The model sees a question and a list of choices (A/B/C/D). Scoring extracts the model's letter choice and compares it to the reference answer.

When to use it: Standard benchmarks like MMLU, ARC, HellaSwag — anything where there's exactly one right answer from a fixed set.

How scoring works: The scorer extracts a letter (A, B, C, D, etc.) from the model output and checks it against the reference. Parsing handles common formatting variations like "The answer is B" or "(B)".

MCQ — Multiple Choice

Same as single-choice, but there can be more than one correct answer. Scoring gives partial credit based on how many correct answers were selected.

When to use it: Knowledge tests where multiple options can be correct simultaneously. Less common in standard benchmarks, but useful for custom evaluations.

How scoring works: The scorer extracts all selected letters and computes set-level F1 against the reference set. Selecting 2 out of 3 correct answers scores higher than selecting 1, but also gets penalized for wrong selections.

Freeform Text

The model generates an open-ended response. There are no predefined choices.

When to use it: Creative writing, explanations, summaries, translation — any task where the "right answer" isn't a fixed string.

How scoring works: Two options:

  • Regex / keyword matching: For responses that need to contain specific content
  • LLM-as-judge: A separate LLM evaluates whether the response meets the criteria. The judge scores 0 (fail) or 1 (pass) against a rubric you provide

Judge scoring is more flexible but costs extra API tokens and is non-deterministic.

Code Generation

The model writes code, and scoring checks if the code actually works.

When to use it: Programming benchmarks like HumanEval or MBPP, or custom tasks where the deliverable is runnable code.

How scoring works: Currently, code scoring works through the judge pathway — an LLM evaluates whether the generated code meets the specification. Full sandboxed execution (running the code and checking test cases) is planned for a future release.

Pairwise — Arena Mode

Two models respond to the same prompt, and a judge picks the better response. Over many comparisons, a ranking emerges.

When to use it: Measuring relative quality when there's no single "correct" answer. Good for subjective tasks like helpfulness, writing quality, or instruction following.

How scoring works: A judge LLM (or human) picks A, B, or TIE. Results aggregate into Bradley-Terry ratings, displayed as Elo-style scores with confidence intervals.

Arena is coming soon

Arena mode is under active development. The evaluation infrastructure exists, but the full UI for pairwise ranking and leaderboard publication is not shipped yet. Use standard runs and comparison for now.

Choosing the Right Type

  • Have a clear right answer? → MCQ (single or multiple)
  • Open-ended but with concrete criteria? → Freeform + LLM judge
  • Comparing two models head-to-head? → Pairwise arena
  • Testing code output? → Code generation

For how to structure these items in a dataset file, see Dataset Format.