LiteLLM Python SDK Integration Blueprint

Status: draft · Owner: Codex

This blueprint describes how the LiteLLM Python SDK fits into the FastAPI + Celery architecture. It keeps provider configuration, run execution, scoring, and LightEval interoperability aligned around one integration surface.

Objectives

  • Centralize LiteLLM client initialization so FastAPI, Celery workers, and LightEval reuse the same defaults.
  • Keep deterministic test coverage possible through local OpenAI-compatible gateways.
  • Expose runtime controls without leaking secrets into the wrong layer.
  • Preserve a thin enough adapter that a future SDK swap remains plausible.

Current Baseline

  • eval_752.services.litellm_client.LiteLLMClient wraps litellm.completion.
  • Celery tasks obtain clients through worker-side helper code that decrypts provider keys.
  • Provider creation stores encrypted API keys and optional rate_limit JSON on the providers table.
  • Unit tests use mocks or fakes; integration tests use a local OpenAI-compatible gateway.

Design Overview

FastAPI endpoints ─┐
                   ├──▶ ProviderService ───▶ LiteLLMClientFactory
Celery workers  ───┘                              │
                                                  ├──▶ direct run execution
                                                  └──▶ LightEval config + execution path

Component Plan

Factory and caching

Introduce LiteLLMClientFactory under eval_752.services:

  • accepts LiteLLMSettings
  • caches sync clients per provider/key combination
  • leaves room for future async clients when streaming needs widen
  • exposes a reset() hook for tests

Settings split

  • bootstrap-only engine selection stays in backend/src/eval_752/app/config.py
  • runtime request timeout and retry policy are persisted through workspace settings
  • the env/UI split must stay documented in Configuration

Retry and rate-limit strategy

  • provider metadata can describe concurrency and other rate-related fields
  • retry behavior should wrap LiteLLM client calls with explicit backoff and structured logging
  • logs should record provider_id, model, retry_count, latency_ms, and token usage when available

Test hooks

  • production keeps a single standard LiteLLM path
  • unit tests may still stub or fake clients
  • integration tests should continue using a local OpenAI-compatible gateway with deterministic model names

FastAPI usage

The /providers/{id}/smoke-test endpoint should use the shared factory so smoke tests and real runs observe the same client defaults.

Celery usage

  • worker execution should go through the same factory
  • streaming-first behavior should remain preferred, with buffered fallback only when the provider rejects streaming
  • structured invocation logs should be attached to the run log stream

LightEval alignment

  • LightEval config generation should pull the same provider base URL and key selection logic
  • when use_lighteval_executor is enabled, the LightEval path should stay configuration-compatible with the direct LiteLLM path

Observability

Recommended metrics:

  • litellm_requests_total
  • litellm_request_latency_seconds
  • litellm_tokens_total

Recommended labels:

  • provider_id
  • model
  • status

Open Questions

  • whether day-level token budgets are needed before the Phase 3 horizon
  • how provider-specific headers should be stored if richer proxy compatibility becomes necessary
  • how streaming adapters should evolve without duplicating logic

Next Actions

  • keep runtime controls and secret templates aligned
  • continue factory hardening and integration coverage
  • keep this blueprint synchronized with implementation and acceptance docs