Trace LLM workflows at your app's semantic level, not at the OpenAI API boundary
The data architecture for provider-agnostic reproducibility and experiments with LLM agents and workflows
"Stop Prompting, Start Engineering: 15 Principles to Deliver Your AI Agent to Production" by Vladyslav Chekryzhov deserves far more attention than it has received. As a practicing AI engineer, I can tell the article is born from hard-won experience in production. The advice and checklists in that article are very worth following.
In this post, I want to discuss more concretely the implementation of some of the principles from Chekryzhov's article, and the resulting data architecture required for LLM agent/workflow apps to embody these principles. And consequently, my thoughts on AI tracing/observability SaaSes such as LangSmith, Helicone, Langfuse, Arize, etc. TLDR: I think they are ultimately not suited for provider-agnostic reproducibility and experiments that rapid AI agents and workflow development demans.
For predictable latency and reliability, workflows have to support many providers and models and be able to switch them at any step
Here I quote the "3. Model as Config" section from Chekryzhov's article:
Problem: LLMs are rapidly evolving; Google, Anthropic, OpenAI, etc. constantly release updates, racing against each other across different benchmarks. This is a feast for us as engineers, and we want to make the most of it. Our agent should be able to easily switch to a better (or conversely, cheaper) model seamlessly.
Checklist:
Model replacement doesn't affect the rest of the code and doesn't impact agent functionality, orchestration, memory, or tools
Adding a new model requires only configuration and, optionally, an adapter (a simple layer that brings the new model to the required interface)
You can easily and quickly switch models. Ideally—any models, at minimum—switching within a model family
I agree with this. What's more, if you develop a user-facing application (such as a chat bot or a support bot) for which latency is important, going fully multi-model, multi-provider is a must.
Here's an incomplete list of ways a specific model or provider can fail in the specific agent/workflow step:
The provider has an outage. (Hello, Anthropic!)
The provider has blocked/censored/refused your request because it got triggered by something in your dialogue, oftentimes actually benign. (Hello, Gemini!)
The model fails to return a valid response schema or a correctly formatted tool call. (Everyone, including OpenAI, but especially everyone else.) Sometimes the response is seemingly cut half-way, in which case retry to the same provider usually will help. But often, the response just has a wrong schema/format, in which case retries may not help, even with non-zero temperature. In theory, non-zero temperature setting should make model’s outputs more variable, but in my experience, it sometimes doesn’t.
The model has randomly failed tool call formatting and instead outputted calls as code or XML tags in the content. (Hello again, Gemini.)
The model randomly hallucinated specific details that must be right, such as IDs of something that should have been tool call parameters. LLM should have picked up from the context, but didn't. Of course, models hallucinate all the time, but in these cases it could be easily verifiable (e.g., the ID doesn't appear in the context) and hence the request should be retried, preferably with a different model.
The provider updated something small about the exact response format, structured outputs/tool calls schema processing, thinking/reasoning, or the like, and this caused one of the insufficiently flexible layers between your app's semantics and the provider to break. These layers can include: proxy services like OpenRouter, client library like LiteLLM/OpenAI client, AI agent/tool call harness like PydanticAI, or your internal LLM call integration layer.
When using proxy service such as OpenRouter, their updates can also silently break with certain providers or models in certain cases (structured outputs, tool calls, streaming, thinking) or their combinations.
You get a burst of usage and start hitting the throughput limits through one of your insufficiently "beefy" accounts. The remedy for this is proxying everything though OpenRouter, but there are many reasons why this probably makes more harm than good: OpenRouter's own bugs (sometimes not circumventable), extra latency, the single point of failure, etc.
Even when the provider doesn't have an outage overall, they sometimes put certain requests "on hold", in which case it may take them minutes even to start streaming you tokens back. Everyone who programmed with AI have experienced these lags.
If the agent step uses reasoning, it may randomly "overthink" a request, thinking for minutes on end on a relatively simple task. This could be almost as bad as a failure for latency-sensitive AI apps.
Exact circuit breakers won't help because the agent's thinking is not exactly repeated in a loop, merely highly repeated. "Fuzzy reasoning circuit breakers" would help, but this requires the next level of harness sophistication: an LLM monitoring the reasoning stream. It is very hard to implement, adds to costs, streaming LLM responses are their own can of bugs and worms especially with OpenRouter and LiteLLM, etc.
Setting hard thinking/reasoning token limits (supported only by Anthropic at the moment, AFAIR?) or
"reasoning_effort": "low"
usually helps, but is often undesirable: we want to let LLM (provider) to decide how much to think somewhat on their own, based on the difficulty of the request. Requests that genuinely require longer thinking do happen.
I used to think that a primary and a single fallback model (for any specific workflow/agent step) were sufficient. In practice, we found that a chain of at least six(!) model configs was the minimum for >99% success rate and latency below threshold, to safeguard against both period-of-time and stray, one-off issues across different providers, accounts, and models.
For example, in the application that I develop, the core reasoning/chat step currently has the following chain of model configs for fallback:
openrouter/google/gemini-2.5-pro-preview
openrouter/google/gemini-2.5-flash-preview:thinking
openrouter/openai/o4-mini
bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0 -- lower than Gemini Flash and o4-mini because it's often slow, and the response latency is important for our application.
gemini/gemini-2.5-pro -- same as the first model config in the chain, but directly though Gemini API rather than through OpenRouter.
gemini/gemini-2.5-flash
Implementation challenge: providers' LLM interfaces are just different and hardly convertible between each other
While I criticise OpenRouter and LiteLLM above, and they have very convoluted code (OpenRouter is wisely closed source), I don't think it's because these proxy layers are built by bad engineers.
Rather, it reflects that there is a ton of essential complexity in the task of conforming dozens of models and providers to OpenAI API which is a quasi-standard (and itself is a moving target, e.g. reasoning no longer available in API responses). FWIW, Standard Completions would largely remedy this, but it's a very distant future at the moment, if happens at all.
When providers (rather than proxies such as OpenRouter who focus on this challenge) try to provide their own "OpenAI compatible" APIs, they often add bugs on their own. (Hello, Grok!) I'm yet to encounter a provider who would be truly compatible with OpenAI API.
The above challenges of transforming OpenAI-format requests into provider/model-specific requests, and then provider/model-specific responses into OpenAI-format responses are hard enough, but transforming between provider/model-specific formats (OpenAI, Anthropic, Gemini, Bedrock), across the feature matrix: (chat structure, structured outputs/response schemas, tool calls, reasoning, streaming) is practically impossible. We've encountered:
Some models don't support a dedicated
"system"
prompt, the chat has to start with a"user"
message.Gemini requires message's content to have separate parts for efficient caching, while other "OpenAI compatible" providers are confused by multiple message parts.
Some providers prohibiting consecutive
"assistant"
messages in a chat, requiring inserting dummy empty"user"
messages between them.Some providers prohibiting
"assistant"
response (content) and tool_call to be in the same message, requiring to artificially splitting such responses with bothcontent
andtool_calls
from other providers. (Obviously, this directly conflicts with the previous point.)Response/structured outputs schemas: what in the ever living f*** is this shambles Google...?
Some otherwise good providers and models don't support "native" reasoning and therefore chain-of-thought reasoning has to be specifically prompted, and then
<thinking>...</thinking>
tags from the response cut out (see official AWS user guide). But wait, if you need structured outputs, it couldn't be<thinking>...</thinking>
, you will need to modify your schema to insert a"general_thinking":
field...
Even more generally, different models work best with different prompts (and less capable models outright require more specific prompting), potentially leading to the matrix (agent/workflow step, provider/model) of system prompts and context formats (such as, for the chat summarisation step, keeping the message list as is vs. condensing the message list in its entirety into a single prompt.)
Trace AI workflows at the application's semantic level
All the incompatibilities between and specialisations for different LLM providers and models mean that to build for reproducibility and experiments with model- and provider-agnostic AI agent and workflows with heterogeneous steps (tool calls, structured outputs, reasoning, and streaming) we must trace workflows at the application's semantic level, not at the lower-level "OpenAI-ish" API boundary, and shape the request for the specific provider and model at runtime. We can see this as "late binding with provider's APIs and specific model's damands".
Application's data model can include abstractions like "message" and "role" as in the LLM chat APIs, if the application is genuinely a chat (e.g., in a support chat bot), but it shouldn't be limited to them, and shouldn't be constrained to them.
For example, if some entities are pulled into the LLM context by IDs (e.g., via search), for LLMs to see them the data of these entities should appear somewhere in the system prompt or chat messages's content as plain text. If the workflow trace is captured at the "OpenAI-ish API/formats boundary", as is the case for most LLM tracing and observability services, this contextual data is "fossilized" within the trace. When debugging such a trace, we couldn't easily tell if it was a data quality problem that may have been resolved independently from the AI workflow logic after the workflow took place in production, or it was an actual LLM hallucination that failed the workflow.
Another benefit of this "late binding" approach to tracing of AI applications over LLM observability SaaS is that it doesn't need to store requests for every LLM response because requests can always be re-created from the data. For chat-like applications, the storage overhead for a long-running chats becomes quadratic, as every new turn requires storing the entire history again.
Use immutable data schema/design to unify application's persistence and tracing
If the application's trace should be kept at the same (semantic) data level at which the core application's logic operates, it becomes clear that that they shouldn't be stored stored separately: the "trace database" can be just the production database(s) that use immutable data schema/designs.
Immutable data designs are quite simple to implement with all types of databases:
Relational OLTP databases like PostgreSQL and MySQL have temporal extensions or built-in temporal tables features. Depending on the scale of the application (and how long in the past you would need to keep traces), this may already be good enough. Otherwise, it's possible to setup change data capture into a database in the next category, or better yet, just pick one at the primary storage from the start. Remember that LLM applications would never notice millisecond difference in point query latency between OLTP and OLAP databases.
In OLAP, time-series, and streaming databases, such as ClickHouse, Databend, Apache Doris, StarRocks, Firebolt, TimescaleDB/TigerData, QuestDB, or RisingWave, the combination of the native
time
column + entity (record) ID identifies the immutable version of the entity's data at the point of time that can be linked from other tables and databases.Graph and document databases like Neo4j and ScyllaDB naturally lend themselves to graph- or chain-like data versioning with copy-on-write "head" entity updates, a la Git.
Preprocessed tabular data is stored in Hive catalogs or the modern alternatives: Apache Iceberg, Apache Hudi, or Delta Lake, that have built-in table versions that can simultaneously serve as the versions for the entities stored these tables.
The most elegant solution, however, would be to use the actual "database inside out" -- Rama, or "immutable databases" like XTDB for the core of the AI workflow logic. I would recommend these for greenfield AI workflow applications if they are not ruled out by organisation's technical strategy that may prescribe selecting from a certain list of databases that are already in use in the organisation (cf. magnitudes of exploration).
In our chat application, we are store the user interaction session's data in a single value (keyed by the session ID) in DynamoDB, with atomic updates to prevent races with async messages from the user. To simulate immutability, all changes to the session are stored in the same document, in a separate "revision_history" field. Other pieces of the semantic data in our application are also stored in DynamoDB and in MySQL.
Connecting the semantic data traces with LLM responses
If an OLAP database is already used to store the semantic trace of the application, it's best to store LLM responses in a separate table (or multiple tables, one per workflow step type/kind) in the same database, to simplify coding your custom evals interfaces.
Otherwise, I think VictoriaLogs is optimal due to its efficiency, operational and configuration simplicity (no need to set up search indexes for every column! don't need an "ingestion pipeline"!), automatic "flattening" of LLM responses (JSONs) built-in, and the built-in analytics analytics console.
Since all providers already send responses with request IDs, there is no need to re-define these IDs for the tracing tables.
In addition to the raw LLM response fields from the provider, the table includes the workflow ID (which is the same as the "trace" ID) and the entity IDs (pointers to immutable versions of these entities) in the semantic data model that were used to construct the context (request) for this LLM call.
The request IDs whose responses have directly contributed to the creation of a version of application's semantic entity can be added into array column(s) in the table row (or fields in the document in a NoSQL db) that represents this entity version.
By LLM requests "contribute to the creation of the version of the entity" not only by literally generating chat messages, structured outputs, and whatnot that end up constituting the entity's data, but also by directing conditional logic:
Pre-generation: fast LLM classification and/or pre-filtering of the user input or external events
Post-generation: guardrails, checking that the LLM didn't "forget its role" in the conversation.
If a post-generation guardrail rejects an output of a model and the workflow step is retried (and succeeds) with a different provider or model, that original rejected LLM response also contributes to the new entity version, being the input for the LLM guardrail call that gave way for the eventually successful LLM response.
Storing LLM request IDs in the semantic data tables as "denormalised" metadata is much easier to program than doing the opposite, storing the IDs of the data that the LLM requests have contributed to. Denormalisation is not an issue because both the LLM responses storage and entities are immutable.
Still, collecting IDs of all contributing LLM calls at the point of writing down the entity version can become quite burdensome without contextvars (in Python) or their equivalents in other programming languages. To pass LLM request ID "bags" in contextvars
between threads (rather than coroutines) in Python is possible with a thread pool executor wrapper that should be used throughout the application code. This reinforces the importance of the keystone principle from Chekryzhov's article: "Own the Execution Path".
In a follow-up post, I'll describe my experience building a custom trace reproducibility/debugging/evals interface using marimo.