Dataset Ingestion (Phase 2 Baseline)

Last updated: 2025-11-09 · Owner: Codex

The FastAPI backend exposes Dataset ingestion utilities. This document captures the backend behaviour, API contracts, and operational guidelines required for Phase 2 work.

Overview

The /datasets router implements three key endpoints:

  • POST /datasets/hf/preview — fetches Hugging Face metadata (features, info, sample rows) without persisting anything.
  • POST /datasets/hf/import — materialises an HF dataset into the relational store after applying a column mapping.
  • POST /datasets/upload — imports an .eval752.zip package (generated by v2 tooling) and stores all sections/items.

All endpoints return a DatasetSummary payload that front-end consumers can use to update local caches.

Hugging Face workflow

Preview

POST /datasets/hf/preview
Content-Type: application/json
{
  "path": "lmsys/chatbot_arena",
  "split": "train[:50]",
  "limit": 10,
  "token": "hf_xxxx" // optional
}

Response shape:

{
  "features": [{"name": "prompt", "dtype": "string"}],
  "samples": [{"prompt": "…"}],
  "num_rows": 5000,
  "info": {"description": "Arena prompts", "citation": "…"}
}

limit constrains the preview sample; slicing (e.g. train[:100]) is delegated to the Hugging Face dataset loader.

Import

POST /datasets/hf/import
Content-Type: application/json
{
  "name": "chatbot-arena-2024",
  "description": "Arena prompts (Oct 2024 snapshot)",
  "config": {
    "path": "lmsys/chatbot_arena",
    "split": "train[:100]",
    "revision": "refs/heads/main"
  },
  "mapping": {
    "prompt_column": "prompt",
    "answer_column": "reference_answer",
    "default_type": "freeform",
    "meta_columns": {"source": "dataset", "difficulty": "level"},
    "criteria_columns": ["grading_rubric"],
    "choices_column": null
  }
}

The import call loads the HF dataset with limit=None, transforms each row into JSONL v2.3 format, computes a deterministic SHA-256 version_hash, and persists a single Section named default that contains all Item objects. The generated Dataset.schema contains:

{
  "source": "hf",
  "config": {"path": "…", "split": "…"},
  "features": {"prompt": "string"},
  "info": {"description": "…"},
  "mapping": {"prompt_column": "prompt", }
}

Validation rules:

  • Missing columns produce 400 responses (e.g. Column \\prompt\ not found in dataset row).
  • choices_column accepts JSON arrays or primitive values.
  • assets_column must decode to an object.
  • Asset entries are normalized to a single internal shape:
    • mime_type is canonical (legacy content_type is still accepted on input)
    • embedded archive files hydrate into encoding=base64 + data
    • remote image URLs remain as url metadata and can still flow into multimodal providers
  • The HF preview panel can show sample-row image previews when the returned samples expose usable image URLs or embedded asset metadata.

.eval752.zip upload

The upload endpoint accepts multipart form data:

POST /datasets/upload
Content-Type: multipart/form-data
name=imported-eval752
file=@dataset.zip

The archive must contain at least one sections/*.jsonl file with JSON Lines following the v2.3 schema. Optional files include manifest.json (name/version hash) and meta.json (description, default judge prompts, etc.). The importer:

  1. Parses every section file and converts each entry into DatasetItem.
  2. Computes a version_hash across all sections/items.
  3. Persists sections in lexical order, preserving the JSON lines content.
  4. Stores raw manifest/meta blobs under Dataset.schema ({"source": "eval752_zip", "package": {…}}).

Invalid packages (missing section files, malformed JSON) return 400 with a descriptive message. If a package contains assets/ files and section JSONL entries reference them, the importer re-hydrates those files back into embedded asset metadata so the uploaded dataset can still send images to providers and preview them in the UI.

Exporting .eval752.zip

Phase 2 introduces a dedicated packaging service (DatasetPackagingService) that assembles datasets—and optionally a completed run—into the portable .eval752.zip archive described in specs/1_specs.md. The exporter writes:

  • manifest.json with dataset identity, section inventory, generator version, and optional run metadata.
  • meta.json capturing dataset notes/schema (and run summary when present).
  • sections/*.jsonl preserving every item in the JSONL v2.3 format.
  • run_config.json and results.jsonl when a specific run is requested, bundling the sanitized LightEval configuration and scored outputs for offline replay.
  • assets/ and checkers/ files resolved from dataset metadata.

CLI helper for archive conversion

A conversion utility is available under scripts/:

python scripts/convert_llmset_to_eval752.py source.llmset.zip --name "Converted Logic Set"

The script reads a .llmset.zip file, resolves embedded prompts/checkers/assets, and emits source.llmset.eval752.zip by default. The converter retains the original manifest/meta blobs inside the new package (meta.json.schema) for traceability, while normalising items to the JSONL v2.3 structure (system prompts become inputs.system_prompt, judge prompts land in meta.judge_prompt, assets gain stable paths plus decoded blobs).

Future API endpoints will expose the same service so the SPA can offer "Download eval752 package" directly from dataset and run detail screens.

Storage layout recap

  • datasets — high level metadata (name, version_hash, schema, notes).
  • sections — ordered subsets of a dataset (order_index starts at 0).
  • items — individual JSONL entries (payload, answer, checker, assets).

When building Playwright scenarios, prefer querying /datasets first to prime TanStack Query caches before navigating to dataset detail views.

Next steps / TODO

  • Extend importer to stream large HF datasets instead of loading into memory (Phase 2 stretch).
  • Support multiple sections during HF import (derive from column or manual grouping).
  • Validate .eval752.zip cryptographic signatures once packaging flow is finalised.
  • Surface dataset summaries (item count, tags) in /datasets response for UI cards.