Advanced Features

This page covers capabilities beyond the basic provider → dataset → run workflow.

Available now

Variant Testing

Test how sensitive a model is to prompt wording.

  1. Open RunsLaunch run
  2. Increase Variations per item
  3. Let the run complete
  4. Review scores in Comparison — look for items where the original passes but variants fail

Use cases: Measuring prompt sensitivity, detecting wording-specific regressions, confirming a model gives consistent answers regardless of phrasing.

Scheduled Evaluations

Automate recurring runs to catch regressions over time.

See Scheduled Evaluations for the full workflow.

Browser Harness

Evaluate models that only exist behind a web UI (ChatGPT, Gemini, etc.).

See Browser Harness for the full workflow.

Independent Judge Providers

Use a different LLM to score responses than the one being evaluated.

In the run launcher and Browser Harness importer, you can specify a separate judge provider and judge model. This is useful when:

  • You want an impartial judge (e.g., GPT-4o judging a Claude response)
  • You want to avoid self-evaluation bias

The run detail panel shows the effective judge provider, model, and prompt used for scoring.

Multi-Modal Datasets

Current support:

  • ✅ Text prompts
  • ✅ Embedded or remote images (sent to vision-capable providers)
  • .eval752.zip export/import with assets
  • 🔜 Audio and video inputs
  • 🔜 Richer publishing workflows

See Dataset Format for how assets are structured.

Coming soon

Arena — Pairwise Ranking

Head-to-head model comparisons with Bradley-Terry / Elo rankings. The evaluation infrastructure exists, but the full leaderboard UI is not shipped yet.

For now, use standard runs and Comparison to compare models.

Custom Scoring Functions

Custom Python scoring hooks for domain-specific evaluation. Currently, eval_752 supports built-in programmatic scoring and LLM-as-judge.

LLM Fingerprinting

Active model identity verification using the LLMmap method — detecting what model a provider is actually serving, regardless of what they claim.

Status: research stage.