eval_752 — Local-first LLM Evaluation

Open-source · Self-hosted · Local-first

Independently evaluate the LLMs you pay for.

eval_752 is a local-first platform to test, compare, and monitor LLM APIs. Run evaluations on your own infrastructure to catch silent model downgrades and test the LLMs you were actually served. Ever Felt off? Test them.

Unified Testing Standard
Route any OpenAI-compatible API through identical workflows for a true apples-to-apples comparison.
Unbreakable Audit Trail
Actual model names, network retries, and item-level latency stay attached to every single run.
Reproducible Evidence
Export self-contained .eval752.zip bundles for independent review instead of sharing screenshots.
eval_752 dashboard
eval_752 dashboard showing active runs, recent results, and workspace health

Keep active runs, recent results, and workspace health in one control plane.

They have the ability to lie. We should have the ability to check.

Third-party LLM APIs are moving targets. Providers can quietly swap models, reroute traffic, or reduce reasoning effort to cut costs without ever changing the product name. If you care about regressions or model identity, you need evidence you gathered yourself.

eval_752 keeps verification natively inside the benchmark workflow. Save the exact model name, smoke-test it, compare providers on identical prompts, and export reproducible bundles that you can verify later.

Research shows that third-party LLM APIs frequently serve different models than advertised — with performance gaps up to 47% and widespread failures in identity verification.

Zhang et al., 2026 · Real Money, Fake Models

  • Smoke test the exact saved model name before spending tokens on serious runs.
  • Pull browser-only models into the same scoring flow with Browser Harness.
  • Schedule recurring runs so model regressions show up as hard evidence, not vibes.

A complete, local-first evaluation loop.

  1. 1
    Connect the provider.

    Save the base URL, absolute model identifier, and verify with a smoke test before paying for tokens.

  2. 2
    Choose the dataset.

    Import a Hugging Face benchmark, bring your own test suite, or score a browser-captured conversation.

  3. 3
    Run and inspect.

    Watch live progress, investigate item-level failures, and compare cost, accuracy, and latency side-by-side.

  4. 4
    Export or schedule.

    Archive a reproducible bundle for stakeholders, or set up recurring runs to catch silent API degradation.

provider setup
eval_752 provider settings form with endpoint, model, and runtime controls
Provider setup keeps operational primitives together: exact model routing, runtime policies, and connection health.

Built with Docker, FastAPI, React, PostgreSQL, Redis, Celery, and LiteLLM. Biased toward reproducible evidence over provider marketing.