Dataset Format

eval_752 stores evaluation items in JSONL (one JSON object per line). This page describes the format so you can understand what's in an import, build datasets by hand, or debug column mappings.

You don't need to memorize this

If you're building datasets through the UI (Dataset Builder or Hugging Face import), eval_752 handles the format for you. This page is a reference for when you want to inspect or manually construct items.

Item Structure

Each line in the JSONL file represents one evaluation item:

{
  "id": "q-001",
  "type": "mcq_single",
  "inputs": {
    "text": "What is the capital of France?",
    "images": [],
    "audio": [],
    "video": []
  },
  "choices": ["London", "Paris", "Berlin", "Madrid"],
  "answer": "Paris",
  "checker": null,
  "criteria": null,
  "meta": {
    "source": "geography-basics",
    "difficulty": "easy",
    "section": "Europe"
  }
}

Field Reference

FieldRequiredDescription
idYesUnique identifier for this item. Must be unique within the dataset.
typeYesOne of: mcq_single, mcq_multi, freeform, code, judge_pairwise
inputs.textYesThe prompt text shown to the model.
inputs.imagesNoArray of relative paths to image assets (e.g., ["assets/diagram.png"]).
inputs.audioNoArray of relative paths to audio assets (planned).
inputs.videoNoArray of relative paths to video assets (planned).
choicesMCQ onlyArray of answer options. For MCQ types, this is required.
answerYesThe correct answer. A string for single-choice, array for multi-choice.
checkerNoCustom checker specification (reserved for future use).
criteriaNoRubric or criteria for LLM-as-judge scoring.
metaNoArbitrary metadata — source, difficulty, section, tags, etc.

Supported Item Types

  • mcq_single — One correct answer from choices. answer is a string matching one choice.
  • mcq_multi — Multiple correct answers. answer is an array of strings.
  • freeform — Open-ended text response. answer is the reference text (used by judge or regex scoring).
  • code — Code generation task. answer contains reference solution.
  • judge_pairwise — Two-response comparison for arena mode.

Multi-Modal Items

For items that include images, reference them with relative paths under assets/:

{
  "id": "img-001",
  "type": "freeform",
  "inputs": {
    "text": "Describe what you see in this image.",
    "images": ["assets/photo-001.jpg"]
  },
  "answer": "A cat sitting on a windowsill.",
  "meta": { "requires": "vision" }
}

When packaged in an .eval752.zip, image files live under the assets/ directory. During a run, eval_752 sends these images to the model if the provider supports vision inputs.

The .eval752.zip Package

An .eval752.zip bundle is a self-contained dataset (or result) package:

bundle.eval752.zip
├── manifest.json        # Package format version and type
├── meta.json            # Dataset metadata (name, description, version hash)
├── sections/
│   ├── section-1.jsonl  # Items for each section
│   └── section-2.jsonl
├── assets/              # Referenced images and files
│   └── photo-001.jpg
├── run_config.json      # (Optional) Run configuration snapshot
├── results.jsonl        # (Optional) Per-item results
└── checkers/            # (Optional) Custom checker scripts

This format is used for:

  • Sharing datasets between eval_752 instances
  • Exporting and archiving run results
  • Importing Browser Harness captures

For import workflows, see Working with Datasets.