LiteLLM Python SDK Integration Blueprint

Status: draft · Owner: Codex

This blueprint describes how the LiteLLM Python SDK fits into the FastAPI + Celery architecture. It keeps provider configuration, run execution, scoring, and LightEval interoperability aligned around one integration surface.

Objectives

Centralize LiteLLM client initialization so FastAPI, Celery workers, and LightEval reuse the same defaults.
Keep deterministic test coverage possible through local OpenAI-compatible gateways.
Expose runtime controls without leaking secrets into the wrong layer.
Preserve a thin enough adapter that a future SDK swap remains plausible.

Current Baseline

eval_752.services.litellm_client.LiteLLMClient wraps litellm.completion.
Celery tasks obtain clients through worker-side helper code that decrypts provider keys.
Provider creation stores encrypted API keys and optional rate_limit JSON on the providers table.
Unit tests use mocks or fakes; integration tests use a local OpenAI-compatible gateway.

Design Overview

FastAPI endpoints ─┐
                   ├──▶ ProviderService ───▶ LiteLLMClientFactory
Celery workers  ───┘                              │
                                                  ├──▶ direct run execution
                                                  └──▶ LightEval config + execution path

Component Plan

Factory and caching

Introduce LiteLLMClientFactory under eval_752.services:

accepts LiteLLMSettings
caches sync clients per provider/key combination
leaves room for future async clients when streaming needs widen
exposes a reset() hook for tests

Settings split

bootstrap-only engine selection stays in backend/src/eval_752/app/config.py
runtime request timeout and retry policy are persisted through workspace settings
the env/UI split must stay documented in Configuration

Retry and rate-limit strategy

provider metadata can describe concurrency and other rate-related fields
retry behavior should wrap LiteLLM client calls with explicit backoff and structured logging
logs should record provider_id, model, retry_count, latency_ms, and token usage when available

Test hooks

production keeps a single standard LiteLLM path
unit tests may still stub or fake clients
integration tests should continue using a local OpenAI-compatible gateway with deterministic model names

FastAPI usage

The /providers/{id}/smoke-test endpoint should use the shared factory so smoke tests and real runs observe the same client defaults.

Celery usage

worker execution should go through the same factory
streaming-first behavior should remain preferred, with buffered fallback only when the provider rejects streaming
structured invocation logs should be attached to the run log stream

LightEval alignment

LightEval config generation should pull the same provider base URL and key selection logic
when use_lighteval_executor is enabled, the LightEval path should stay configuration-compatible with the direct LiteLLM path

Observability

Recommended metrics:

litellm_requests_total
litellm_request_latency_seconds
litellm_tokens_total

Recommended labels:

provider_id
model
status

Open Questions

whether day-level token budgets are needed before the Phase 3 horizon
how provider-specific headers should be stored if richer proxy compatibility becomes necessary
how streaming adapters should evolve without duplicating logic

Next Actions

keep runtime controls and secret templates aligned
continue factory hardening and integration coverage
keep this blueprint synchronized with implementation and acceptance docs

#LiteLLM Python SDK Integration Blueprint

#Objectives

#Current Baseline

#Design Overview

#Component Plan

#Factory and caching

#Settings split

#Retry and rate-limit strategy

#Test hooks

#FastAPI usage

#Celery usage

#LightEval alignment

#Observability

#Open Questions

#Next Actions