Scoring Methods
eval_752 uses three scoring approaches. Each has different tradeoffs in cost, determinism, and flexibility.
Quick Comparison
Programmatic Scoring
Rule-based scoring that runs locally. No API calls needed.
How it works:
- Exact match: Compares extracted answer to reference (case-insensitive, whitespace-normalized)
- Set match / F1: For multi-select MCQ — compares answer sets and calculates partial credit
- Regex patterns: For freeform responses that need to match a specific pattern
- Numeric tolerance: For math tasks where
3.14and3.141should both pass
When to use it: Whenever the "right answer" can be defined precisely. MCQ benchmarks, math tasks, factual lookups. This is the default and fastest scoring path.
Tradeoffs: Can't handle nuance. If the model gives a correct answer in an unexpected format, programmatic scoring might mark it wrong.
LLM-as-Judge
Uses another LLM to evaluate whether a response meets a given rubric.
How it works:
- eval_752 sends the question, the model's response, and a rubric to a judge LLM
- The judge returns a binary score:
1(pass) or0(fail) - The rubric can be a reference answer, a set of criteria, or both
The judge prompt is strict by design — it outputs only 1 or 0, not explanations. This makes results easier to aggregate and compare.
When to use it: Freeform text tasks where there's no single correct string — summaries, explanations, translations, creative writing. Also useful when "correct" means "meets these criteria" rather than "matches this exact answer."
Tradeoffs:
- Costs money: Each judged item makes an API call to the judge provider
- Non-deterministic: The same response might get scored differently on different runs (though in practice, well-written rubrics give consistent results)
- Judge bias: The judge model has its own biases. Using a different judge model can change scores
You can configure a judge provider separately from the model being evaluated. This lets you use (for example) GPT-4o as a judge while evaluating a different model.
Arena — Pairwise Comparison
Compares two model responses side by side. A judge picks the better one or calls it a tie.
How it works:
- Two models receive the same prompt
- A judge LLM sees both responses (anonymized as "Answer A" and "Answer B")
- The judge picks
[[A]],[[B]], or[[TIE]] - Over many comparisons, scores aggregate into Bradley-Terry (Elo-style) rankings
When to use it: When you care about relative quality rather than absolute correctness. "Is model A better than model B at writing emails?" can't be answered with exact match — but pairwise comparison can.
Tradeoffs:
- Expensive: Each comparison needs responses from two models plus one judge call — 3x the cost of a normal scored run
- Needs volume: A handful of comparisons isn't statistically meaningful. You need enough items for confidence intervals to tighten
- Position bias: Judges may prefer whichever answer appears first. The Bradley-Terry aggregation mitigates this over many items
Arena mode is under active development. The pairwise scoring pipeline exists, but the full leaderboard UI is not yet available.
Which Method Should I Use?
For most users starting out: programmatic scoring with MCQ datasets is the fastest, cheapest, and most deterministic way to compare providers.
See Running Evaluations for how to configure scoring when you launch a run.
