Scoring Methods

eval_752 uses three scoring approaches. Each has different tradeoffs in cost, determinism, and flexibility.

Quick Comparison

MethodCostDeterministic?Best for
ProgrammaticFree✅ YesMCQ, structured answers
LLM-as-judgeAPI tokens❌ NoFreeform text, nuanced criteria
Arena (pairwise)API tokens❌ NoRelative quality, subjective tasks

Programmatic Scoring

Rule-based scoring that runs locally. No API calls needed.

How it works:

  • Exact match: Compares extracted answer to reference (case-insensitive, whitespace-normalized)
  • Set match / F1: For multi-select MCQ — compares answer sets and calculates partial credit
  • Regex patterns: For freeform responses that need to match a specific pattern
  • Numeric tolerance: For math tasks where 3.14 and 3.141 should both pass

When to use it: Whenever the "right answer" can be defined precisely. MCQ benchmarks, math tasks, factual lookups. This is the default and fastest scoring path.

Tradeoffs: Can't handle nuance. If the model gives a correct answer in an unexpected format, programmatic scoring might mark it wrong.

LLM-as-Judge

Uses another LLM to evaluate whether a response meets a given rubric.

How it works:

  1. eval_752 sends the question, the model's response, and a rubric to a judge LLM
  2. The judge returns a binary score: 1 (pass) or 0 (fail)
  3. The rubric can be a reference answer, a set of criteria, or both

The judge prompt is strict by design — it outputs only 1 or 0, not explanations. This makes results easier to aggregate and compare.

When to use it: Freeform text tasks where there's no single correct string — summaries, explanations, translations, creative writing. Also useful when "correct" means "meets these criteria" rather than "matches this exact answer."

Tradeoffs:

  • Costs money: Each judged item makes an API call to the judge provider
  • Non-deterministic: The same response might get scored differently on different runs (though in practice, well-written rubrics give consistent results)
  • Judge bias: The judge model has its own biases. Using a different judge model can change scores

You can configure a judge provider separately from the model being evaluated. This lets you use (for example) GPT-4o as a judge while evaluating a different model.

Arena — Pairwise Comparison

Compares two model responses side by side. A judge picks the better one or calls it a tie.

How it works:

  1. Two models receive the same prompt
  2. A judge LLM sees both responses (anonymized as "Answer A" and "Answer B")
  3. The judge picks [[A]], [[B]], or [[TIE]]
  4. Over many comparisons, scores aggregate into Bradley-Terry (Elo-style) rankings

When to use it: When you care about relative quality rather than absolute correctness. "Is model A better than model B at writing emails?" can't be answered with exact match — but pairwise comparison can.

Tradeoffs:

  • Expensive: Each comparison needs responses from two models plus one judge call — 3x the cost of a normal scored run
  • Needs volume: A handful of comparisons isn't statistically meaningful. You need enough items for confidence intervals to tighten
  • Position bias: Judges may prefer whichever answer appears first. The Bradley-Terry aggregation mitigates this over many items
Coming soon

Arena mode is under active development. The pairwise scoring pipeline exists, but the full leaderboard UI is not yet available.

Which Method Should I Use?

Is there a single correct answer?
├── Yes → Programmatic scoring
└── No
    ├── Am I comparing the same prompt across models? → LLM-as-judge
    └── Am I comparing two models head-to-head? → Arena (pairwise)

For most users starting out: programmatic scoring with MCQ datasets is the fastest, cheapest, and most deterministic way to compare providers.

See Running Evaluations for how to configure scoring when you launch a run.