Dataset Ingestion (Phase 2 Baseline)
Last updated: 2025-11-09 · Owner: Codex
The FastAPI backend exposes Dataset ingestion utilities. This document captures the backend behaviour, API contracts, and operational guidelines required for Phase 2 work.
Overview
The /datasets router implements three key endpoints:
POST /datasets/hf/preview— fetches Hugging Face metadata (features, info, sample rows) without persisting anything.POST /datasets/hf/import— materialises an HF dataset into the relational store after applying a column mapping.POST /datasets/upload— imports an.eval752.zippackage (generated by v2 tooling) and stores all sections/items.
All endpoints return a DatasetSummary payload that front-end consumers can use to update local caches.
Hugging Face workflow
Preview
Response shape:
limit constrains the preview sample; slicing (e.g. train[:100]) is delegated to the Hugging Face dataset loader.
Import
The import call loads the HF dataset with limit=None, transforms each row into JSONL v2.3 format, computes a deterministic SHA-256 version_hash, and persists a single Section named default that contains all Item objects. The generated Dataset.schema contains:
Validation rules:
- Missing columns produce
400responses (e.g.Column \\prompt\not found in dataset row). choices_columnaccepts JSON arrays or primitive values.assets_columnmust decode to an object.- Asset entries are normalized to a single internal shape:
mime_typeis canonical (legacycontent_typeis still accepted on input)- embedded archive files hydrate into
encoding=base64+data - remote image URLs remain as
urlmetadata and can still flow into multimodal providers
- The HF preview panel can show sample-row image previews when the returned samples expose usable image URLs or embedded asset metadata.
.eval752.zip upload
The upload endpoint accepts multipart form data:
The archive must contain at least one sections/*.jsonl file with JSON Lines following the v2.3 schema. Optional files include manifest.json (name/version hash) and meta.json (description, default judge prompts, etc.). The importer:
- Parses every section file and converts each entry into
DatasetItem. - Computes a
version_hashacross all sections/items. - Persists sections in lexical order, preserving the JSON lines content.
- Stores raw manifest/meta blobs under
Dataset.schema({"source": "eval752_zip", "package": {…}}).
Invalid packages (missing section files, malformed JSON) return 400 with a descriptive message.
If a package contains assets/ files and section JSONL entries reference them, the importer
re-hydrates those files back into embedded asset metadata so the uploaded dataset can still send
images to providers and preview them in the UI.
Exporting .eval752.zip
Phase 2 introduces a dedicated packaging service (DatasetPackagingService) that assembles
datasets—and optionally a completed run—into the portable .eval752.zip archive described in
specs/1_specs.md. The exporter writes:
manifest.jsonwith dataset identity, section inventory, generator version, and optional run metadata.meta.jsoncapturing dataset notes/schema (and run summary when present).sections/*.jsonlpreserving every item in the JSONL v2.3 format.run_config.jsonandresults.jsonlwhen a specific run is requested, bundling the sanitized LightEval configuration and scored outputs for offline replay.assets/andcheckers/files resolved from dataset metadata.
CLI helper for archive conversion
A conversion utility is available under scripts/:
The script reads a .llmset.zip file, resolves embedded prompts/checkers/assets, and emits
source.llmset.eval752.zip by default. The converter retains the original manifest/meta blobs inside
the new package (meta.json.schema) for traceability, while normalising items to the JSONL v2.3
structure (system prompts become inputs.system_prompt, judge prompts land in meta.judge_prompt,
assets gain stable paths plus decoded blobs).
Future API endpoints will expose the same service so the SPA can offer "Download eval752 package" directly from dataset and run detail screens.
Storage layout recap
datasets— high level metadata (name,version_hash,schema,notes).sections— ordered subsets of a dataset (order_indexstarts at 0).items— individual JSONL entries (payload,answer,checker,assets).
When building Playwright scenarios, prefer querying /datasets first to prime TanStack Query caches before navigating to dataset detail views.
Next steps / TODO
- Extend importer to stream large HF datasets instead of loading into memory (Phase 2 stretch).
- Support multiple sections during HF import (derive from column or manual grouping).
- Validate
.eval752.zipcryptographic signatures once packaging flow is finalised. - Surface dataset summaries (item count, tags) in
/datasetsresponse for UI cards.
