Advanced Features
This page covers capabilities beyond the basic provider → dataset → run workflow.
Available now
Variant Testing
Test how sensitive a model is to prompt wording.
- Open Runs → Launch run
- Increase Variations per item
- Let the run complete
- Review scores in Comparison — look for items where the original passes but variants fail
Use cases: Measuring prompt sensitivity, detecting wording-specific regressions, confirming a model gives consistent answers regardless of phrasing.
Scheduled Evaluations
Automate recurring runs to catch regressions over time.
See Scheduled Evaluations for the full workflow.
Browser Harness
Evaluate models that only exist behind a web UI (ChatGPT, Gemini, etc.).
See Browser Harness for the full workflow.
Independent Judge Providers
Use a different LLM to score responses than the one being evaluated.
In the run launcher and Browser Harness importer, you can specify a separate judge provider and judge model. This is useful when:
- You want an impartial judge (e.g., GPT-4o judging a Claude response)
- You want to avoid self-evaluation bias
The run detail panel shows the effective judge provider, model, and prompt used for scoring.
Multi-Modal Datasets
Current support:
- ✅ Text prompts
- ✅ Embedded or remote images (sent to vision-capable providers)
- ✅
.eval752.zipexport/import with assets - 🔜 Audio and video inputs
- 🔜 Richer publishing workflows
See Dataset Format for how assets are structured.
Coming soon
Arena — Pairwise Ranking
Head-to-head model comparisons with Bradley-Terry / Elo rankings. The evaluation infrastructure exists, but the full leaderboard UI is not shipped yet.
For now, use standard runs and Comparison to compare models.
Custom Scoring Functions
Custom Python scoring hooks for domain-specific evaluation. Currently, eval_752 supports built-in programmatic scoring and LLM-as-judge.
LLM Fingerprinting
Active model identity verification using the LLMmap method — detecting what model a provider is actually serving, regardless of what they claim.
Status: research stage.
