Engineering Ideas

Tasklet is the "o1 moment" for long-horizon AI agents that learn on the job

Roman Leventov — Mon, 27 Oct 2025 15:52:34 GMT

A couple of weeks ago, Andrew Lee unveiled Tasklet, an AI agent with a two-tier design: a long-lived, high-level agent curates the system prompt, toolset, and memories for individual “sub-agents”, i.e., individual task runs.

Memories and results are stored in an SQL database and made available for the sub-agents to explore via SQL queries (agentic search in the DB) to build the context for the specific task. For example, for a customer relationship tasklet, the sub-agent may be tasked to respond to an e-mail from the customer, and it may search in the database past interactions with that specific customer; if this is a new customer inquiring about product X, the sub-agent may search in the database for past sales of product X.

The system invites feedback from the user at the end of the task runs for the higher-level agent to curate the data/memories/results and to improve the system instructions for future runs.

Please read or listen to Andrew Lee’s interview by Nathan Labenz for more details. I highly recommend it.

Also recently, Anthropic introduced Agent Skills for Claude. Agent Skills is a move away from MCP servers towards simple .md instructions for using this or that app or API that are downloaded by the user and could be edited by the user or AI agents, unlike MCP servers that have static, predefined commands and prompts that the user could only turn on or off, but not adjust for their needs.

The adaptable Agent Skills resonate with the Tasklet system design: Tasklet also prefers using HTTP APIs over MCPs and building context-specific instructions for using specific HTTP APIs effectively.

I believe that these two announcements are a kind of “o1 moment” (from OpenAI’s o1 model announcement) in the domain of long-horizon, continually learning AI agency that Dwarkesh Patel has made a meme from. (For more recent context, see Nathan Lambert’s contra opinion from two months ago.)

That is, I believe that the Tasklet two-tier design, plus post-training LLMs by frontier AI labs (starting with Anthropic, obviously, but there is no doubt that all other apps are doing this too or will soon follow) to be more effective at picking up “skills”, i.e., lengthy and possibly multi-file text instructions for using this or that app or API, is sufficient for prosaic continual learning, and in a few months a lot of providers are going to replicate and advance this AI agent architecture.

Above, I referred to this as an “o1 moment” because Tasklet’s two-tier system design is not surprising. It was an “open secret” that OpenAI are doing RL on LLM throughout 2024. The key is a credible demonstration that this design works, at which point many players will rush to replicate it. In the same vein, very soon after the o1 model announcement, many AI labs began doing RL on LLMs and soon matched OpenAI’s results, most famously DeepSeek.

The main difference of the Tasklet announcement from OpenAI’s o1 announcement is, of course, that Tasklet is much lower profile and didn’t make a huge splash, but I’m sure that all the relevant players have taken note.

Let’s build open-source Tasklet-like agents where users fully own their context

Tasklet’s product design is good for AI diffusion, and hence net positive from the perspective of economic disempowerment. But Tasklet is still a classical SaaS that owns and locks in its data, which leads to AI context fragmentation for the users.

Building an open-source Tasklet alternative on top of the personal AI data platform (I’ve called it Pocketdata) would let the user still be in full ownership and control of all of their context, along with other benefits that I outlined in this post:

strict “pay only for the inference that you have actually used” billing with the option to self-host,
the freedom to swap LLM models and other service providers (including local, fully private inference), and
the freedom to mix and match with other pieces of their personal data plane, such as chats or deep research.

If you are interested in building (or already building) an open Tasklet alternative with the above characteristics, let’s talk! You can reach me at leventov.ru@gmail.com. I’m personally busy building Pocketdata (the infrastructure), and there is enough complexity and groundwork there to require my full attention for many more months. On the other hand, if you are focusing on agent engineering, you may want to delegate the infrastructure work to someone.

Why Pocketdata is the right platform for personal Tasklet-like agents

In the rest of this post, I’ll make an argument for why Pocketdata is the “right” platform for Tasklet-like agents if the aim is to make these agents fully private, user-owned, secure, and self-hostable.

There has already been significant convergence between Pocketdata and Tasklet’s agent design unbeknownst to me. The “AGENTS.md equivalent for personal data“ that I’ve proposed is suspiciously similar both to “skill” and the tasklet sub-agent’s instruction for querying past data in its memories and past results.

Also, since publishing the first technical blueprint for Pocketdata, I made two significant changes to the platform design, both of which are conducive to building an open-source Tasklet alternative on top of Pocketdata.

From Pocketbase to Postgres

First, I’ve ditched the idea of using vanilla Pocketbase. I replaced it with Postgres and plan to later rebuild a “Postgres-flavoured Pocketbase” on top of it, such as Zhenruyan’s “Postgrebase“.

The major reason for this change is that choosing Pocketbase as the primary storage in Pocketdata forces AI apps that may onboard Pocketdata to be modified on the source code level, and perhaps quite significantly so if they don’t use ORMs or other suitable abstractions. For example, Mail Zero supports only Postgres as the storage. Realistically, this would be too big an ask for open-source AI app developers to invest resources in supporting Pocketdata unless/until it gains a huge user base.

On the other hand, with the Postgres-first approach, onboarding AI apps on the platform requires just some config changes, perhaps Dockerfile and init script changes, and upstreamable bug fixes and improvements. This is much more sustainable for Pocketdata to maintain on our own for a few key apps (such as Open WebUI, Mail Zero, the proposed open Tasklet reimplementation, and the like), without asking permission from the maintainers of the upstream projects.

Initially, all apps will have their own schemas and users in Postgres, and these Postgres users have read and write rights only in their respective schemas, ensuring isolation between the apps.

When the Postgres in Pocketdata regains Pocketbase/Postgrebase-derived Go sidecar and the “common schema” collections: chats, notes/docs, emails, etc. (see discussion in the previous post), the AI apps could start to expose and integrating their data with the rest of the “personal data plane” gradually, by dual-writing to their own schema and to “Pocketbase-owned” collections (living in their own, protected schema to which only the postgres superuser can write). Alternatively, for the key apps such as Open WebUI, these integrations could be shipped by the Pocketdata itself in the same permissionless way, by bundling the required hooks for Open WebUI’s (Mail Zero’s, etc.) data schema in Pocketdata’s container image.

Another, slightly unexpected advantage of using Postgres and the “separate Postgres schema and Postgres user per app” design is that it permits raw SQL queries in JS hooks with proper app isolation via a trick: wrapping all raw SQL queries in JS hooks as functions with a SECURITY DEFINER clause when apps register their hooks (I’ve mentioned the idea of apps owning their hooks here).

Tasklet-like agents should each have their separate Postgres schemas and users for strong isolation, i.e., they should be separate “apps” in the Pocketdata platform.

From LiteLLM to Bifrost and Agentgateway

In the previous post, I wrote that “LiteLLM doesn’t have serious alternatives at the moment” as an LLM gateway. However, after another series of gripes with LiteLLM’s performance and code quality, I’ve tried to search for alternatives once again, and I’m happy to report that there is a serious alternative to LiteLLM as a standalone LLM gateway server: Bifrost.

Bifrost is written in Go and uses the fasthttp library. This makes me confident in its performance and low memory footprint for streaming LLM requests. This enables hosting Bifrost on the same Fly machine with Postgres and its sidecars (pgBackRest, pgBouncer, and the Pocketbase-like Go process), albeit in a different container.

Since Bifrost is not an MCP gateway nor a generic OpenAPI gateway, a separate gateway would be needed for these purposes. Agentgateway is covering this. Since Agentgateway is written in Rust, it could also live in the same Fly machine. However, since Agentgateway doesn’t support storing config and auth keys in Postgres (cf. this PR), there is some work to do, but it should be relatively straightforward.

Note that LiteLLM is an LLM and MCP gateway, but not a full-fledged HTTP API gateway. Therefore, the introduction of a separate gateway system was inevitable even if Pocketdata was still using LiteLLM.

As Andrew Lee described in the interview, Tasklet agents are more successful in accessing services via HTTP APIs than MCPs, provided that the agents have access to “skills” to teach them how to use these service APIs.

The personal AI platform: technical blueprint

Roman Leventov — Tue, 09 Sep 2025 09:28:54 GMT

Giving a more concrete shape to the platform part of the Personal Agents vision. The primary motivations for the Personal Agents agenda are: reduce the power imbalance between "Big Token" corps and people; and enable political and media innovation through deployment of personal "political representative" AI agents, "info agents", and the like.

Summary

The unifying abstraction between various personal AI apps and integrations should be the common models (schemas and access APIs) for personal context data: chats with AI, notes/documents, emails, tables/records, visited web pages, media feeds such as news and podcasts, and so on.

AI chats, notes, and tables live in Pocketbase.
Immutable objects: web page snapshots, emails, PDF docs, and media (podcasts, images, etc.) go to LanceDB.
LLM requests and responses (for AI app debug, LLM usage analysis/audit, and "Reasoning" collapsible expansion in AI chat UIs) and service API call logs are also stored in LanceDB.

Pocketbase connects to the personal computer to sync browsed web pages, notes and tables (e.g., from Obsidian), and personal media. Alternatively, the app running on the personal computer can act as an MCP server for the AI apps, such as Open WebUI that are also deployed privately alongside the data plane services:

Pocketbase with app-specific and user-defined JS extensions,
The search and data ingestion service that embeds LanceDB,
The LLM, MCP, and service API proxy that is based on LiteLLM and also embeds LanceDB for storing LLM (and other service API calls, such as web search, web scraping, audio transcription, media generation, etc.) requests, responses, and costs.

The personal AI platform (that includes the data plane and the AI apps that use it, such as Open WebUI) could be deployed either in a private VPS or in a cloud service such as Fly.io, where each user has a separate private org to which they can deploy custom AI apps at will.

The fact that the data plane is private, not censorable, and not owned by a big tech corp should increase people's trust and willingness to upload their personal data: browsing history, emails via IMAP from Gmail, etc.

In turn, this personal context data is very attractive for AI app developers who face the "cold start" problem: the AI app isn't very useful until it has access to the personal context. The common data model for things like chats with AI, notes/documents, e-mails, etc., makes a "warm start" possible: people can deploy the AI app privately and see how it works with their existing chats, documents, and other context. Thus, people can experiment with different AI apps without any data import/export hassle.

Many apps have chat UIs so they can leverage the common data model. Yet, the apps still can differentiate a lot in terms of their focus and specific knowledge (think financial assistant vs. psychotherapist), communication styles and agency levels ("analyst" vs. "doer"), personalisation abilities, tools such as web search or MCP used, etc.

Specialised apps such as medical or financial consultants can do vector search over all of the existing personal context across modalities (AI chats, search and browsing history) proactively while the app is executed for the first time to personalise the suggested prompts or even start working on the user's recent questions or problems right away.

The freedom of choice between different AI apps and LLMs providers, the ability to vibe code the personal system of record, and unified billing for hosting, LLM calls, and other API services should make the personal AI platform more attractive to people than limited, "one size fits none" subscription offerings from big tech corps.

This is the first post in a three-part series. In the rest of this post, I detail the reasoning behind the technical decisions that shaped this personal AI platform vision so far, concerning the personal data plane architecture for deployment simplicity, durability, and amenability to multiple AI apps coexisting on the platform. I also describe the development path to the minimum viable personal AI platform at the end of this post.

In the second post, I'll focus on the platform's privacy and security architecture. After that, I'll discuss possible PaaS providers’ and AI app developers’ business models for the personal AI platform.

The middle level in the personal AI platform's hourglass architecture: databases, data model, and query APIs

For the context on multi-level/layer architectures, see my earlier posts "Architecture theory and the hourglass model" and "AI agency architecture-in-the-large: the relevant levels of abstraction".

The idea that the "personal AI app platform" should be a data platform first and foremost seems so self-evident to me that I'm having a hard time explaining why I think so: I don't even see realistic alternatives. Cf. the proposition that I made in another recent post: that AI app developers should better "use immutable data schema/design to unify app's persistence and tracing".

Here's the obligatory hourglass diagram:

Six months ago and earlier, the vast majority of agentic frameworks were predominantly focused on the "control flow and intelligence stuff": workflow execution, workflow design, agent orchestration, tools, MCPs, and capabilities rather than persistence and data models.

Convex Agents is the only exception that comes to my mind. Other frameworks in which the persistence schema is externalised and made somewhat a public API are Letta (via the Agent File format) and Mastra (see Mastra's data schema). However, neither Letta nor Mastra supports persistence to a reactive database.

I call a database reactive if it supports real-time data subscriptions and embedding custom HTTP routes/methods and "triggers" written in a popular programming language without severe runtime constraints (to distinguish from good old SQL RDBMS triggers). Data grid, database inside out, and headless CMS are overlapping ideas.

A reactive database permits independent AI apps (written in any language and with any agentic framework, or without any framework) to interoperate around the common data model.

For example, a specialised app can produce chats with some app-specific fields, but if those chats are still stored in the common chats table, they can at least be searched over, or even picked up by another, generic chat app. This is helpful if the user has abandoned the app that produced some chats, but may still search and access those chats through other, generic AI chat apps (such as Open WebUI) that they still use.

For the above case, database reactivity per se is not necessary. However, there are plenty of use cases for personal AI-first system of record or exocortex that do require database reactivity and "micro-ETL" within the database, such as:

An LLM workflow that classifies and triages inbound emails.
A proactive personal assistant that suggests creating tasks or action items from chats and searches.
An "info agent" that processes many information feeds (news, podcasts, discussions in groups, cold inbound requests, etc.) to prepare the five most salient/attention-worthy posts or messages per day to present to the human.

Architectural goals for the personal AI platform

In the following few sections, I'm explaining why I picked Pocketbase as the personal data plane's reactive database and LanceDB as the database for searching immutable data and logging LLM and other service API calls, respectively.

Keep in mind the architectural goals for the personal AI platform that motivate these technology choices:

Simplicity of deployment: the personal AI platform must be deployable on a single host, such as a personal computer or a VPS.
Cloud-grade durability: even in a single-host setup, it should be possible to durably back up the data to object storage, and this backup itself shouldn't require complicated extra machinery. If the object storage service is separate from the VPS hosting, this safeguards the personal data even against abrupt VPS account termination and provides a high degree of resilience to deplatforming.
Privacy: the personal AI data platform should be fully functional even when the user's LLM activity doesn't exit the host (or the Fly.io org in a cloud setup), with LLM inference through Ollama or similar. It should be possible to configure the data plane such that access to the host machine's local disks doesn't leak personal data, either in self-hosted or cloud setups. So, all data on disks should be encryptable by user-owned keys.
LLM security: screen all inbound data (emails, web pages, etc.) for prompt injections by default. Post-call guardrails for all LLM calls made from the AI apps are ON by default. This explains why the platform has a separate LLM proxy service (based on LiteLLM) rather than leaving LLM routing and guardrails to AI apps.
Amenability to multiple AI apps coexisting on the platform, where these apps are developed independently and externally. This means that apps should be able to register their extensions to the reactive database dynamically, i.e., without the need to re-deploy the database. This motivates the stricter LLM security baseline: we cannot count on independent AI app developers to always consistently follow the LLM security practices.
Governability and software supply chain security: it should be possible to configure which data tables (emails, browsed web pages, notes, etc.) every deployed app has access to. It should be possible to configure manual approvals for apps' requests for data access, dynamic trigger registrations, or requests to skip LLM guardrails. Manual approvals can be skipped only if the app's container image meets certain criteria, such as: built from a public GitHub repo with provenance attestation and doesn't register new vulnerabilities (relative to the previous deployed app version) with osv-scanner.
Business-friendliness: to attract app developers and enable commercial hosting of the personal AI app platform (which is critical for the personal AI platform adoption, as self-hosting would require people managing too many separate service API and hosting subscriptions; this is discussed in more detail in ), the technologies comprising the data plane have to be open source rather than source available.

Reactive database: Pocketbase

Here are reactive databases (see my definition above) that I'm aware of:

Supabase: edge functions in JS or TS, persistence to Postgres.
Nhost: serverless functions in JS or TS, persistence to Postgres.
Convex: backend functions in JS or TS, persistence to Postgres or SQLite.
SurrealDB: embedded JS functions can be used from events, custom storage.
Rama: Java and Clojure APIs, custom storage.
RisingWave: UDFs in Python, JS, Rust, or Java, persistence to Apache Iceberg.
Pocketbase: extensions in Go or limited JS, persistence to SQLite.
Strapi: backend customisation in JS or TS, persistence to Postgres, MySQL, SQLite, ...
Directus: API extensions in JS or TS, persistence to Postgres, MySQL, SQLite, ...
Apache Ignite: Java, C#/.NET and C++ primary APIs, custom/pluggable storage.
Google's Firebase: cloud functions in JS, TS, or Python, proprietary storage.
AWS Amplify: custom functions in JS, TS, Python, Go, Java, ..., proprietary storage.

Rama, RisingWave, and Apache Ignite focus on scalable, high-performance backends, not scrappy full-stack AI apps operating with tiny amounts of data. They use custom storage engines, which would not be a very legible choice for the open data plane platform.

Supabase and Nhost only support Postgres as the persistence option, which is overkill for tiny personal AI data planes. They are also focused on scalable backends and "real" cloud deployments, not small personal backends and PaaS/scale-to-zero deployments: e.g., Nhost relies on AWS Lambda for running extension functions. Also, both Supabase and Nhost are notoriously hard to self-host.

I think the personal data plane should be based on SQLite, which should be synced to object storage for durability via Litestream.

Convex and Directus support persistence to SQLite, but are released under source available, non-OSI-approved licenses: Functional Source License (FSL) and Business Source License (BSL), respectively. This is a deal-breaker for the open personal AI platform and businesses that may want to build with it. SurrealDB also uses BSL.

Strapi is open source, but treats SQLite as a second-class, "demo usage only" persistence option.

Trailbase is not listed as a reactive database above because it doesn't support record create/update/delete triggers. It's focused on supporting the creation of a single coherent application on top of the single database, not a loose federation of interoperating apps.

Payload CMS has a permissive license. It supports SQLite (or, at least, libSQL) as a first-class persistence option alongside Postgres.

The reason why I didn't list Payload CMS as a reactive database is that it doesn't support realtime data subscriptions yet. However, this isn't an inherent architectural limitation. Also, Payload positions too much as a "Next.js framework" rather than a language-agnostic "database with HTTP APIs, auth, and other goodies" (like Supabase, Convex, and Pocketbase), which might be alienating for AI app developers who don't use JS/TS. The startup that develops Payload CMS was recently acquired by a big corporation (Figma), which is a double-edged sword.

So, by exclusion, I currently choose Pocketbase as the reactive database for the personal AI data plane. It ticks all the boxes: permissively licensed, persisted to SQLite, and has a well-thought-through extensions API. However, Pocketbase has a notable downside (for the purposes of putting it at the core of the open data plane/platform): it's not open to contributions. So is SQLite, though, but, of course, SQLite is an incomparably more proven, stable, and sustainable project, and has a much more mature ecosystem of plugins.

Overall, to my mind, this is still a relatively close call between Pocketbase and Payload CMS, so I may change the decision in the future.

Data schema with versioned objects enables extending Pocketbase with serverless code in any language

Pocketbase's runtime for dynamic JS extensions (hooks, custom routes, middleware, and migrations), goja, is rather limited: it doesn't support fetch() and async/await code.

This would be a serious downside of Pocketbase. However, the personal data plane use case permits database-managed subscriptions: triggering serverless functions, or directly the AI app that needs a database extension, without live HTTP connections for server-sent events.

Containerised extensions enabled by database-managed subscriptions are also essential for the deployment of independently developed AI apps (and/or serverless functions) against a single database so that they don't conflict on library versions required. To this end, this system design would also be necessary with any other reactive database, even those that include V8 JS/TS runtimes, like Convex or Trailbase. All these reactive databases were designed to be a backend for a single app, not an unknown federation of apps.

Database-managed subscriptions are possible to implement on top of Pocketbase due to the following characteristics of the target use case, i.e., the personal context data plane:

The common data model should anyway keep the version history of mutable objects: messages and results in AI chats (the user can edit their own or AI messages or re-run the AI turn), the chat itself (the user can drop turns or insert new messages), notes/documents, table records, etc.
These version histories will generally be very short. The only exception is documents that may have thousands of versions if the app saves the document to the database every few seconds, but for AI app interoperation, document updates can promptly be rolled up to 15-minute or longer intervals: there is no realistic use case for AI apps to eavesdrop on document updates with higher granularity.
The total number of versioned objects is small (thousands or tens of thousands at most?), and the cumulative object update frequency (~1 update per second at most) is tiny by the standards of what SQLite is capable of processing in terms of data volume and transactions per second.
The total number of subscribers (AI apps and serverless functions) is a few dozen at most.
Interoperation across AI apps through the personal data plane shouldn't be formally correct and precise: these are just personal AI apps, not high-stakes financial transactions. AI apps’ interactions will at most be something like background context enrichment, fine-tuning, or perhaps kicking off some optional research workflow on behalf of the user. So, trigger extensions for AI app interoperability don't need to propagate transactions correctly through the unified "versioned database" graph. Concretely speaking: if one app updated two or more objects within a single transaction (e.g., the message within the chat and the chat itself), another app that subscribes to these updates should be fine if it receives these two updates independently, in any order, or receives one and doesn't receive another because it gets reverted or overwritten quickly, or, in rare cases, misses updates altogether.

The actual implementation of database-managed subscriptions could be as follows: a database-managed subscription is created an API request with the same parameters as realtime subscriptions, plus the target address, path, and extra parameters for PUT requests that the subscription should produce. Each subscription creates a separate table in SQLite to store the object versions that are not yet consumed by the target serverless code or app. This table acts as a subscription queue (with deduplication). Upon each update to the subscribed object collections, the obsoleted versions of the updated object(s) are removed from the subscription queues. Object versions older than a month are also deleted (not just from the subscription queues but from SQLite altogether).

Each subscription is processed on the Pocketbase side by a dedicated goroutine that makes PUT requests to the target sequentially. Any non-error HTTP response from the target is considered a successful "consumption" of the subscription event, and it's removed from the queue. Any error cases exponential backoff before the next attempt: 5 seconds, 1 minute, 15 minutes, etc. The failing subscription means that the target app is broken or the user has shut it down. Failing subscriptions are displayed in Pocketbase's admin dashboard UI and can be removed altogether. Then the SQLite table that acts as the subscription queue is dropped. Subscriptions are automatically removed after a month of failing.

This is a very "native" and inefficient queue/stream processing solution compared to Kafka, RabbitMQ, etc., but I'm pretty sure it will work fine for personal data planes, where the expected data volumes are so low (as noted above). The advantage, of course, is that there are zero extra systems to run and manage, particularly with regard to the durability of subscription queues. Whereas when both the "main" data collections (tables) and supporting subscription queue tables are persisted in the same SQLite, there is a single mechanism that takes care of their durability, Litestream.

Other Pocketbase plugins required

Apart from database-managed subscriptions, as described above, other additions to Pocketbase are needed to make it the engine for the personal context data plane.

Database-managed JS extensions

JS extensions should be managed in SQLite and exposed via HTTP APIs rather than uploaded to the server disk. It's needed for the data plane security and privacy (more on that in the following post in this series), as well as for manageability and for the support of uncoordinated deploys of different apps. For example, the JS extension should be "owned" by the AI app that submitted it, and only that AI app and the superuser (root/admin) can update or delete the extension. See more details in the issue.

Full-text and vector search over objects

Full-text and vector indexing plugins are needed to enable searching among the latest versions of the objects: messages, chats, documents, table records, etc. Rody Davis has built the prototypes for both these plugins (see his pocketbase-plugins project) using SQLite's built-in fts5 module for full-text search and sqlite-vec for vector indexing.

I think it's better to use SQLite modules rather than bleve or embedded LanceDB with local-disk storage because fts5's and sqlite-vec's on-disk indexes are automatically encrypted if the whole SQLite is encrypted (via SQLCipher, discussed in the following post in this series). Similarly, these indexes piggyback on the main SQLite's backup to object storage. These are the same arguments that motivate building subscription queues on top of SQLite rather than with separate specialised systems.

Another option is to ingest versioned objects (chats, messages, notes) into the same LanceDB instance that stores immutable objects, see details below. The main downside of this approach is that it would make the Pocketbase instance less self-sufficient for simplified data plane setups without LanceDB. Also, this makes search indexes for versioned objects unavailable for extension JS routes and hooks.

On the other hand, with the LanceDB approach, the search APIs over versioned and immutable objects would be unified. Also, if the data plane has many thousands of messages across chats, the fact that sqlite-vec does full scans to find the nearest vectors may add too much CPU demands for the Pocketbase deployment and will require running it in more expensive VMs in a Fly.io setup to maintain acceptable latency.

So, I haven't decided yet between these two approaches (fts5 and sqlite-vec vs. LanceDB) for search over versioned objects: both approaches have pros and cons that seem commensurate to me.

Metrics

Another table-stakes addition to Pocketbase is exposing the Prometheus-style /metrics endpoint for Pocketbase monitoring (to be collected by the bring-your-own metrics store: see discussion below), as already implemented in magooney-loon's pb-ext project.

The common data schema: to be determined

I haven't yet worked on the specific details of the data schema for chats/threads, messages, and notes/documents. Probably they will be the least common denominator of:

Open WebUI's data schema: see messages, chats, prompts, notes, and tools models.
Jan's data schema: messages, threads, etc.
Dify's schemas are expressed in their REST APIs: see conversation history messages and document detail. I don't think Dify will actually be one of the supported apps (it is targeted more at organisational use cases and requires Postgres for storage), but the schema itself is battle-tested and therefore is worth close attention.
LangChain v1: standard content blocks in messages.
Letta's Agent File schema: messages, tools, memory blocks, agents, etc.
Mastra's data schema: messages, threads, resources, etc.
Convex Agents' schema: messages, threads, memories, files, etc.

Except, none of these schemas explicitly version objects (chats, messages, or memories/notes/documents), and only Convex Agents' objects are implicitly versioned (as everything in Convex's tables) and recoverable via the standalone TableHistory component.

I still ardently believe that mutable objects in the common data model have to be versioned, and these versions should even be exposed to the user in most chat AI apps, as ChatGPT does (see the "3/3" with arrow buttons):

But since few developers of open source AI apps currently appear to think the same, the common data schema (expressed in a set of Pocketbase's HTTP APIs and JS hook APIs) should provide the "simplified" view, such as GET /api/collections/[messages|chats|notes]/record/{id} route for getting the latest version of the object, PATCH /api/collections/[messages|chats|notes]/record/{id} route for making a new version of an object, and OnVersionedRecordUpdate() hook API. The ability to create versioned collections (in addition to Pocketbase's Base, View, and Auth collections) should itself be implemented in a separate Pocketbase plugin, perhaps.

Every application can add app-specific fields to one of the common schemas (e.g., chats/threads or notes) through a custom migration. If these fields are critical for this application's functionality or presentation, the app should register a custom route that filters the corresponding Pocketbase collection for the presence of these fields, and use this route from the app's "list" or the "entry point" view (e.g., the list of chats or the list of notes) so that when the user opens this app and clicks on one of these chats or notes, the app can work with them.

The common schemas will also have "createdBy" fields that store the name of the app that created the given object (the app name will be available as requestInfo.authRecord.id; Pocketbase will authenticate apps, not the "end users"; more on the auth architecture in the following post in this series), so the app could also just filter the collection on the objects that it created.

Currently, I don't think that there should be a common schema for "workflows" or "agent traces" (despite they are defined by multiple schema sources listed above under various names): I think they are too varied among different agentic frameworks and specific AI apps to be interoperable and usefully "listenable" (via JS hooks or database-managed subscriptions) by different apps. However, as the apps store these workflows and agent traces using their bespoke schemas in Pocketbase, they should enable full text and vector indexing of the contents of these workflows to make them "broad-base searchable" from other AI apps.

AGENTS.md equivalent for the personal data plane?

An interesting concept and a good candidate for being a part of the common data schema or convention that currently appears to be missing from all of the schemas above would be an equivalent of AGENTS.md, but for the personal context data plane rather than for coding agents. It's not exactly a Prompt from Open WebUI because prompts are specific to particular tasks and apps, while the "personal data plane's AGENTS.md" should be a more general "system" instruction for agents, instructing them how to work with this data plane, similar to how AGENTS.md is not a prompt either. A part of this instruction, which may be specific to each app, should be the list of Pocketbase collections that this AI app has access to and how they are used.

Hybrid search over immutable objects: LanceDB

Searching over a subset (or all) of digital artifacts that the person has encountered or received, including their emails, file uploads via AI (chat) apps such as Open WebUI, visited web pages (either by the person or by AI agents on their behalf), transcripts of watched YouTube videos, personal meetings, and podcasts, RSS/media items/news, etc. should be a core capability of the personal data plane available to AI apps via HTTP API since almost all AI apps require search.

First, I want to note why a separate search DB is need at all and SQLite's fts5 and sqlite-vec extensions that are proposed to be used for indexing versioned objects (chats, messages, notes) couldn't be used to index and search over immutable objects (emails, webpages, transcripts, files, media) as well: sqlite-vec may be sufficient to search over a few thousands of messages and chats (although this isn't yet clear to me without benchmarking), but its full scan vector search will definitely be too inefficient when/if there are two orders of magnitude more objects: the person may have only a few chats with AI per day on average, but the personal "Info Agent" or the personal AI assistant may easily ingest hundreds of media items and emails per day.

Of the dozens of vector and search stores, only two seem mostly compatible with the architectural goals for the personal AI platform and the data plane: (1) the simplicity of deployment both in a VPS (or locally) and in a scale-to-zero cloud PaaS like Fly.io, and (2) a straightforward way to backup the data durably in object storage: Chroma and LanceDB:

Chroma uses SQLite for WAL, and this SQLite instance could be backed up to object storage using Litestream, in the same way as Pocketbase's SQLite. (Although nobody appears to have tried to do this yet.)
LanceDB OSS supports object storage as the primary storage backend.

However, Chroma doesn't support hybrid search with BM25 yet and has to run as a standalone process. On the other hand, LanceDB could be embedded in the Python process that also implements data ingestion and checks the search permissions: see below in this section.

The biggest drawback of LanceDB OSS, it seems, is that when configured with object storage, it can't also use local disk for caching, which may increase the search query latency and the object storage egress. The "cold" search latency in a scale-to-zero cloud setup in Fly.io would probably never be smaller than ~0.7..1s to the AI app (not to the user, as the AI app may further post-process the search results with LLMs): a few hundred ms for resuming the Fly.io machine from suspended state, a few hundred ms for fetching Lance files from object storage, and some time for actual search query processing, and Fly Proxy round-trip latency. But that seems like an acceptable tradeoff. Chroma's lack of hybrid search seems like a more significant limitation, so I chose LanceDB as the search database for immutable objects in the personal data plane.

Data ingestion into LanceDB should be mediated through the Pocketbase instance. This is needed both to permit additional "micro-ETL" workflows submitted by the AI apps as hooks over these data feeds. Additionally, there should be default workflows that scan the inbound data for prompt injections to quarantine or sanitise it automatically (see more on the security architecture in the following post in this series). Finally, Pocketbase batches inbound data and inserts it in LanceDB once every 15 minutes or so to prevent LanceDB to be up too much of the time in a scale-to-zero cloud deployment, assuming it's unreasonable to set LanceDB instance's suspend wait timeout shorter than 60 seconds, given "agentic search" use cases like "search, then process results with an LLM, which emits another search tool call, repeat".

Document chunking algorithms, embedding approaches (fixed, matryoshka, or multi-vector for late interaction), embedding aggregation and hierarchical retrieval algorithms (like RAPTOR or Gwern's hierarchical embeddings for text search) are all undecided for now. The data plane should provide sane defaults for different data formats: plaintext, HTML, Markdown, and PDF.

Since no single set of algorithms can work equally well for all kinds of data, the above algorithms should be configurable per specific table (emails, webpages, transcripts, media) and/or per specific feed. The most practical way to implement this, it seems to me, is to wrap the LanceDB instance in a thin Python server that implements these pre-configured algorithms, and permit configuring the algorithms in a dedicated system table in Pocketbase.

The Python API layer also enforces table and column access permissions for the AI app that makes the search request, by consulting Pocketbase, while the LanceDB library does the actual search. (The app authenticates itself to the Python server with a dedicated key generated when the auth record for the app is created in Pocketbase; more on the auth and security architecture in a later post in this series.)

Custom, app-specific algorithms could also be supported without altering the LanceDB+Python server container image via database-managed subscriptions (see above) that send the data batches to the AI app, which uses Lance API to merge-insert their custom column values (custom embeddings, custom metadata, etc.) to records in the common tables: emails, webpages, etc.

GraphRAG-Bench paper shows that sophisticated, graph-based retrieval approaches (GraphRAG, HippoRAG, LightRAG, etc.) are more effective than "simple" embedding-based retrieval for queries against dense knowledge sets, such as technical documentation or medical instructions. The data that is supposed to be stored in the personal data plane's LanceDB (emails, webpages, and documents) is not dense knowledge sense. Hence, I don't see a point in supporting anything resembling GraphRAG in the personal AI data plane. Incidentally, this means that a graph database is not needed, which is a huge relief because it would significantly increase the overall system's complexity.

LLM, MCP, and service API proxy: a Python service with LiteLLM

The need for a standalone LLM proxy/gateway in the personal AI platform stems from the LLM security and governability requirements: see "Architectural goals for the personal AI platform" above.

Guardrails should be turned on by default for AI apps' LLM calls. AI apps should be able to submit to the Pocketbase their "recommended" settings, where they specify for which types of requests guardrails are unnecessary (e.g., because these are simple intent classification or "routing" requests) and which tools should be available (a la MCP gateway). Similar to the search access controls and chunking+embedding algorithm settings (see the previous section), the person can review these settings in the Pocketbase dashboard and track changes across the different deployed versions of the given AI app.

Apart from LLM security considerations, another reason to make LLM calls through a centralised proxy is to log LLM responses consistently across AI apps and consistently record LLM and other service API costs for spend analysis: see the following section.

In self-hosted setups, it's possible to restrict the app container's access to the internet by placing the container only on an internal network in Docker Compose and letting inbound HTTPS requests through a nginx or Caddy reverse proxy container and the outgoing requests through the LLM proxy container. In the Fly.io cloud, it should be possible to achieve similar isolation with Fly machine Network Policies that deny all egress except for the internal (Fly Proxy) traffic, so access to the proxy service, the Pocketbase instance, and the search service is still permitted.

There are surprisingly few open-source standalone HTTP proxy (gateways) for LLM calls: LiteLLM, Portkey AI gateway, and MLFlow AI gateway, maybe? And of these, only LIteLLM supports cost calculation. Also, LiteLLM has at least an order of magnitude more activity on GitHub: bug fixes for obscure combinations of providers, APIs, and features like reasoning, streaming, tool calling, structured outputs, etc., new providers added, and cost map updates. So, despite my gripes with LiteLLM's code and that LiteLLM is presumably the slowest of the popular AI gateways (which probably doesn't matter for the personal AI platform because it will generate very modest LLM call throughput), LiteLLM doesn't have serious alternatives, in my view.

If the AI apps don't have open internet access, they must proxy all their external service API calls, such as web search, web scraping, audio transcription, media generation, etc. LiteLLM supports this through custom pass-through endpoints. Similar to the guardrails, tools, and LLM model access configurations, the AI apps should submit requests to use a certain API via an HTTP request to Pocketbase, and the proxy service verifies the app's permission to access a certain API service at runtime.

Unfortunately, LiteLLM's own proxy server couldn't be used unmodified because it uses PostgreSQL and LiteLLM developers don't plan to support even vanilla SQLite, let alone Pocketbase. However, the personal data plane's LLM and service API proxy should support only a subset of LiteLLM proxy server's features and hence only a subset of their database schema, so I think that maintaining a fork of LiteLLM proxy server that uses Pocketbase instead of PostgreSQL should be manageable, despite generating a steady stream of maintenance work.

LLM and other service API call logging: LanceDB

LLM calls and other service API calls such as web search, audio transcription, translation, etc., should be logged for AI app debugging, API spend analysis, security audit, and broad-based search by other AI apps, e.g., a "system admin" AI agent that lives on the personal AI platform itself and helps the user with other app updates and debugging.

In "Connecting the semantic data traces with LLM responses", I advocated for VictoriaLogs because of its operational and configuration efficiency. Object storage backend for "cold" logs (e.g., older than a day) is a work in progress for VictoriaLogs.

However, since the data plane already uses LanceDB for search over immutable media objects anyway, it would be even simpler than VictoriaLogs to also use LanceDB for LLM and service API call logs as well.

LanceDB could be embedded in the Python proxy and used for data insertions only. Since this would be the "same" LanceDB as is used in the search service, logs could be queried and searched through the search service. Such separation is helpful for scale-to-zero deployments in Fly.io because the proxy service (Fly machine) could still have relatively little memory (500MB to 1GB) while the search service needs to have more more memory (probably, 2GB) but is called less frequently than the LLM and service API proxy and therefore is suspended for more time in total when it doesn't accrue the hosting cost.

LanceDB's upcoming JSON support will be handy for LLM request/response querying and analysis.

The LLM call log schema could be somewhere in between LiteLLM_SpendLogs and
OpenTelemetry's gen_ai_attributes schemas.

Metrics store: bring your own

As an alternative for LanceDB for LLM request/response log storage, I've considered OpenObserve that could have been used for storing LLM call logs and the various personal AI platform metrics, reported by the data plane services and the AI apps themselves. However, I decided that this is unnecessary to make the metrics store a part of the data platform. In self-hosted and homelab deployments, there is almost definitely some metrics store deployed apart from the personal AI platform, such as VictoriaMetrics, Prometheus, or ClickHouse (via SigNoz or a similar observability app). In Fly.io, there is a hosted metrics store as well.

The personal AI platform's metrics are not strictly required to be durable (unlike LLM and service API call logs), so they don't have to be synced to object storage. Using OpenObserve, a metrics store that uses object storage as the primary data storage, would incur unnecessary object storage write (ingress) amplification. Also, OpenObserve would have to be a separate service in Fly.io deployments, which would cost some extra for the users, whereas Fly.io's built-in metrics store is included in the platform cost.

The minimum viable personal AI platform: Open WebUI, browsing history, and email search

Despite paying most attention in the personal data plane architecture to the aspects of independent AI apps coexisting on the same platform and how the apps and the user can benefit from that, a much simpler and faster path "break even value" for the platform is simply making the AI apps and integrations meet: AI apps begin become more useful when they have access to the personal data that is ingested to the personal data plane through integrations. This will motivate people to use the given AI app on top of the personal AI platform (Pocketbase and LanceDB) rather than their "default" databases.

My immediate plan for making an MVP of the personal AI platform is:

Implement Pocketbase storage for Open WebUI, without any data remodelling yet. Currently, Open WebUI supports PostgreSQL or SQLite storage through SQLAlchemy, so the code is already somewhat accustomed to pluggable storage. Also, this should be relatively simple to do because Open WebUI basically doesn't do transactions, so all its database operations can translate into separate CRUD HTTP calls to Pocketbase.
Implement an MVP version of the search service based on LanceDB with one particular document chunking and embedding approach.
Make a Pocketbase plugin that reads emails from Gmail via IMAP, using go-imap, probably, and pushes the emails to the search service.
Create a Chrome plugin similar to full-text-tabs-forever that reads all pages that the person visited on desktop Chrome or another Chromium-based browser and pushes them to Pocketbase, which will in turn push them to the search service.
Add a "personal data search" tool to Open WebUI to ground AI chats with search in personal email, newsletters, and browsing history.

AI agency architecture-in-the-large: the relevant levels of abstraction

Roman Leventov — Mon, 28 Jul 2025 14:25:17 GMT

Introduction

This post continues the series in which I apply John Doyle’s architecture theory and the hourglass model from network systems engineering to AI agency architecture.

Below, I’ll use the terms level, (abstraction, model, theory), component (subsystem, layer), diversity hourglass, composability, hijackability, and others with specific technical meanings described in the previous post in this series.

This post is an overview of the variety of possible abstraction levels that are relevant or being discussed in relation to AI agents. I’ll not make normative claims here about how I think AI agency architecture should be steered. Hopefully, the concepts of abstraction level (interface, protocol), multi-level control (immune systems), composability, hijackability, and generality will be used in AI agency architecture work and discussions elsewhere.

The phrase “relevant abstraction level” above means that are some people who think there is something crucially important in application or relation to AI agent (architecture) about the theory/model/method/ontology/design/interface/protocol of the said abstraction level. Or, to put it differently, some people think that some of these levels and their designs are “key unlocks for AI agents” and bet that most (or at least, a considerable fraction of) future AI agents should share the same method/design/interface/protocol on those specific level(s) of their interest.

Consequently, people try to steer the architecture in the large (i.e., architecture across the organizational boundaries) of AI agency towards those level(s) to be the middle (i.e., shared, “low diversity”) level(s) in the diversity hourglass architecture (a.k.a. the waist and neck architecture1 when there is more than one shared/low diversity level).

However, diversity hourglass is a double-edged sword and therefore people should better be thoughtful when promoting their preferred diversity hourglasses, that is, architectures with different middle (low diversity) levels and their respective models/designs/interfaces/protocols.

Classes of abstraction levels relevant for AI agency architecture

Each section in the remainder of this post describes some abstraction class relevant for AI agency architecture. These abstraction classes are not coming from the architecture theory in some principled way, I’m basically grouping abstractions ad hoc for making the description more manageable—otherwise, this post would need to have hundreds of sections.

The descriptions of the abstraction classes below follow roughly from lower- to higher-level ones.

Any such categorization, including mine below, unavoidably has some subjective coarse-graining/“quantisation” of the space that is actually infinitely malleable. In fact, almost all real-world “AI (agent) platforms” (such as Dify, Intelligent Internet’s Contexts, Open WebUI, CrewAI, Flowise, Replit, Lindy, Tactics, SingularityNET, Anthropic’s app platform, and countless others) repackage some features and aspects of multiple of the abstraction levels as I describe them below into unique models/abstractions of their own.

Relatedly, many of the concrete examples of abstractions, protocols, systems, and designs that I give below could be attributed to multiple abstraction classes in my list.

Data storage and computing platforms

Operating systems and “OS-like”2 platforms, such as POSIX, Web standards/browser-as-an-OS (see Browserbase, StackBlitz’s WebContainers). Confusingly, when people talk about “LLM OS” they usually refer to the Compound AI System (CAIS) level, see below.
Cloud computing platforms such as Fly.io, Vercel, Replit, Render.
AI inference/compute platforms, such as Modular MAX, Replicate, Fireworks, Together, or on the “local” side, Ollama.
Microservice, container, or serverless platforms, such as Kubernetes, or the crop of new AI-agent-specific ones, such as Daytona and E2B.
Reliable/durable (workflow) execution frameworks, such as Dagger, DBOS, and many others.
Database-integrated or “database-inside-out” (stream) processing engines, such as Convex, Materialize, Rama, Hopsworks, Neon, and others.
LLM (Dev)(Ops) and routing platforms such as Arize, Flowise, Langfuse, Requesty, W&B Weave, etc.
Secure/private/trusted execution and federated AI (learning) frameworks, such as OpenMined’s PySyft, Flower.ai, and more.

Composability

Data storage and computing platforms per se rarely introduce meaningful directions of composability.

Kubernetes operators is an example of such a direction, but it doesn’t scale well, and Kubernetes is seldom thought of as the “platform for AI agents”. Inference platforms like Modular MAX and Fireworks may have composability a la “end-to-end computations spanning several models”, but that would actually be thanks to the underlying frameworks: Mojo and PyTorch (see the next section).

I say that computing platforms don’t introduce composability per se because actually many of them do, but on the separate API level which in practice is often bundled together with the computing platform abstraction. The examples of such APIs are OpenAI-ish completions API of the OpenAI inference platform itself and many other inference platforms that copy OpenAI, Convex API (not SQL!), Dagger API, Rama’s API, DBOS’s API, etc.

This means that any diversity hourglass architecture with computing platform as the middle level naturally tends to shift towards the API level as the middle, whereas the computing platform itself becoming the lower implementation level “behind” the API. Platforms with complex API abstractions, such as Nvidia’s platform with CUDA successfully resist this shift because they are harder to re-implement, whereas simpler abstractions such as completions API are commoditized faster.

Note that I’m not claiming that all the APIs mentioned above are composable—I didn’t actually study most of them. In fact, the one API among those that I did work with, OpenAI completions API, is actually not composable at all: as soon as you need LLM to summarise a conversation happened in/with this API, you need to abandon the original API trace and condense the conversation within a single message, or else you risk the LLM to forget its “summarizer role” and just continue the conversation (also, it’s token-inefficient). Anyone who’ve worked with completions API would admit this quickly gets cumbersome and ugly. Compare it with any reasonable programing language design (that are usually highly composable), where this operation would simply be summarize(conversation) or something like that.

Hijackability

Permissionless cloud computing platforms can give rise to self-sovereign agents that earn crypto through fraud, spam, and other activity that is purely net hamful for humans (or neutral for humans, but competing with human-beneficial activity for computing resources) and pay for their own compute. These agents might be surprisingly hard to stamp out. See this Beren Millidge’s post for more on this risk.

Machine learning and inference frameworks

Examples: PyTorch, JAX, Keras, Mojo, MLJ (Julia’s ML framework), or JuliaDiff more broadly.

Machine learning frameworks are seldom brought up as the key abstraction for AI agents. However, they are key for composability at the computing platform level (see above), as well as for end-to-end (composable) neural net component optimisation that is an important piece of some people’s vision for AI agents:

Yann LeCun’s vision towards autonomous machine intelligence (2022)3: relies on end-to-end learning signal propagation across the components of this cognitive architecture (world model, actor, critic, etc.)
Cooperative language-guided inverse plan search (CLIPS), by Zhi-Xuan, Ying, Mansinghka, and Tenenbaum (2024)4 is a Bayesian agent architecture that uses LLMs as probabilistic samplers within a larger probabilistic model, written in Julia.
“Neural architecture-level”5 neuro-symbolic AI approaches, such as (van Bergen et al., 2024)6.

Composable?—Yes, and in a strong, general way. This is very much the point of machine learning frameworks.

Neural net architectures

“Neural net architectures” themselves can be thought to lump together dozens of finer sub-levels, all the way from CUDA kernels and activation functions up to high-level abstractions such as Mixture of Experts. Most of these lower sub-levels are not very relevant for AI agency architecture. However, some higher-level aspects and sub-levels of neural net architectures are very much relevant:

Information integration and/or “in-context” retrieval mechanisms, such as Transformer-style attention, state-space modelling, recurrency, and diffusion.
Model adaptation methods, such as continual learning/pre-training (specific methods are very much dependent on the specific NN architecture, be it a standard GPT-style Transformer, a spiking or another biologically plausible NN, or a “liquid” NN), or (post-training) low-rank adaptations.
Sparse autoencoder features and higher-level objects/abstractions on top of them (circuits, spaces, etc.) could be used to monitor or control AI agent behavior via “pulling the threads”: enabling or disabling features dynamically, see InterPLM. Indeed, sparse autoencoder feature and circuit dynamics may be very dependent on specific neural net architectures (and even small features of those such as activation functions and LayerNorms, cf. Elhage et al., 20227), or even not available (at least, practically) or “crippled” in some NN architectures, perhaps liquid ones?

Composability

Most relevant NN architectures, such as Transformers are composable. This is the essence of the scaling hypothesis. On the neural net architecture level, composition means “stack more layers 🤪” or alternatively, “add more MoE experts”.

However, what this kind of composition means for the AI agency-relevant aspects of NN architectures (information integration and retrieval, model adaptability, and mechanistic interpretability and control), or behavioural aspects such as scheming (Meinke et al., 2024)8 should be considered.

The influences of composition (scaling) on these agent-relevant aspects could (at least, for some NN architectures) appear to be non-monotonic: first positive, but later negative, as the models are scaled up.9

Hijackability

In the LessWrong lore, the theoretical possibility of the NN becoming self-aware during the training process and hijacking it (or at least steering it) is known as gradient hacking. I think that “strong” gradient hacking, as this concept was originally conceived, i.e., during non-contextualised forward pass of an LLM, is basically impossible: see Gradient hacking is extremely difficult (Millidge, 2023). Alignment faking in large language models (Greenblatt et al., 2024)10 has also been called “gradient hacking”, but that would be an instance of the hijackability of the training data generation process (see the following section) rather than the NN architecture itself.

Learning problem definitions and training data

In ML research and engineering, NN architectures (see the previous section) and learning problem definitions are usually developed and studied together as ML architectures. Of course, this makes a lot of sense because the characteristics and the success or failure of ML models depend on both. However, in relation to AI agents, learning problem definitions, training data, and training strategies (aka. protocols, processes, recipes) bring up somewhat different considerations from NN architectures, hence I discuss them as a separately.

As well as NN architectures, learning problem definition and training strategies are complicated groups of interacting models on different levels that are bundled together into some packages:

Types of inputs and outputs for the model, such as tokens of text, synthetic/abstract tokens, dynamic chunks (Hwang et al., 2025)11, AST tokens or graphs (for program synthesis), image embeddings, etc.
Learning protocol/setting, such as self-supervised learning aka “next token prediction”, online/offline RL, on/off-policy RL, direct preference learning, etc., and their sequencing across pre- and post-training.
Loss/objective, such as token prediction objectives, energy based modelling (EBM) objectives, RL objectives.
Training data collection or generation, such as:
- by paid humans (Scale AI);
- making screen recordings of real experts doing real work (Workshop Labs is betting on this);
- simulation (e.g., for training embodied agents; see Nvidia’s Isaac Sim);
- synthetic data generation, (open-ended) interactive environments such as Minecraft or Metta; or
- product-research co-design.
Training data sequencing, such as curriculum learning.

There are obviously endless combinations of the above making distinct learning problem definitions. Here’s a tiny sample, in which I try to reflect the diversity of approaches that are being considered, including in relation to AI agents:

Reasoning models that are post-trained with RL to search in the token space:
- “Simple” Chain-of-Thought as described in DeepSeek-R1 (2024)12,
- Meta-CoT (Xiang et al., 2025)13,
- Reinforcement Learning Teachers (Cetin et al., 2025)14,
- using perplexity as a reward signal (Tang et al., 2025)15, and
- multi-chain-of-thought as in OpenAI’s o3-pro, Gemini Deep Think, and Grok 4 Heavy.
Pre-training LLMs with retrieval (Shao et al., 2024)16.
Latent Program Network for gradient-based search in latent program space duing test time (Bonnet and Macfarlane, 2024)17.
Joint Bayesian inference of graphical structure and parameters with a single GFlowNet (Deleu et al., 2023)18.
Large Concept Models (LCM team, 2024)19.
Joint Embedding Predictive Architecture (LeCun, 2022) [3].
Various neuro-symbolic approaches: see (Wan et al., 2024)20, and the next section.
Graph Neural Networks (GNNs), Generative Adversarial Networks (GANs), etc.

Composability

In the context of learning problem definitions, the generalization capacity (of an ML architecture) is exactly that I call composability in this post. In ML literature, there are plenty of direct claims that the generalization capacity varies between different problem definitions, such as Reward is Enough (Silver et al., 2021)21, SFT memorizes, RL generalizes (Chu et al., 2025)22. Famously, the problem that LLMs were originally designed to learn (predict the next token) was not widely thought to be composable before GPT-3 (Brawn et al., 2020)23. Reasoning models trained on top of LLMs (2024 onwards) learn to solve even higher-level problems. However, still there are doubts whether this approach can generalize (aka “scale”) much further than certain domains with sharp and easily verifiable rewards, namely math and programming. Indeed, it is this doubt or disbelief that motivated people to develop a lot of alternative learning problem definitions that are hoped to generalize/scale better, including most of the learning problem definitions mentioned above.

Note that here “learning problem composability” should mean that the problem is still solved with a single model inference episode (rollout). Solving bigger/harder problems with scaffolding shifts into the territory of compound AI systems, see below.

A notable concern with LLMs and LLM-based reasoning models is that they often can’t do targeted modifications of existing artifacts as well as they can generate artifacts anew. Such targeted modifications can be active inference and plan corrections for AI agents. So, it can be said that while LLMs generalize (scale) relatively well to higher-level (composite) problems, they don’t enable a rich repertoire of operations in relation to these problems24.

An alternative, data-centric view seems to be on the rise (Zha et al., 2025)25, which includes some or all of these beliefs: (1) current ML architectures don’t actually generalize in some strong sense; (2) “in-distribution is the new generalization”; (3) open-ended pipelines/flywheels/environments for generating high-quality and rich training data is “all you need” to move to the next S-curve in intelligence scaling. See “The only important technology is the Internet” by Kevin Lu (2025) for a detailed argument. In the context of this post, this discussion moves into the territory of much higher levels (user interface, medium, product, economic/social networks), discussed later.

Hijackability

The NN (such as an LLM-based reasoning model) hijacking the reinforcement learning process it’s been subjected to is known as reward hacking (Weng, 2024)26. In the more specific context of RLHF, the manifestation of reward hacking is known as sycophancy (Sharma et al., 2023)27.

Training data hijacks as known as poisoning. The training data generation process could also be hijacked in various ways, either by LLMs themselves (as in Greenblatt et al., 2024) or by the corporations owning the products (such as social media platforms) that are used to generate the training data or labels from perverse incentives.

See Wang, Zhang, et al. (2025)28 for many other ways in which the learning problem definition or the training data could be hijacked or goodharted.

Languages, ontologies, and data models for LLM-generated programs, plans, and knowledge representation

Generating programs with LLMs is the most tractable neurosymbolic architecture because it’s both easily composable with LLM-based reasoning models when they generate natural language. This architecture could also reuse the reasoning models with just a little extra RL training, if not simply prompting to generate programs.

Program synthesis is a promising approach towards AI with a higher generalization capacity (Knoop and Chollet, 2024)29.

Apart from scaling general problem solving/intelligence, program synthesis is also thought to scale risk estimate and safety cases/assurances: see the Guaranteed Safe AI agenda (Dalrymple et al., 2024)30.

The boundaries between programming language, DSL, ontology, and data models are fuzzy, and indeed in languages designed for powerful composability, such as Lisps and MeTTa, program and data representations are completely homoiconic.

The composability of this or that language, ontology, or data model is often hotly debated. For example, Safeguarded AI embodies davidad’s view that even existing mathematics are not entirely sufficient for general world modelling, let alone any of the existing ontologies and languages. On the other hand, Walters et al. (2025)31 argue that hierarchical probabilistic models readily presentable in probabilistic PLs or DSLs such as PyMC in Python are enough to scale safety cases to any practical level of precision or risk tolerance.

The hijackability (exploitability) of formal ontologies and data models in application to agentic behaviour and decision-making are discussed under the labels of coherence arguments, completeness, and money-pump arguments (Thornley, 2023)32.

Natural language

The natural language is the “native” interface for LLMs.

A tiny sample of abstraction levels developed by people on top of natural language includes language-based reasoning (more or less corresponds to Aristotelean logic), role-based approach to business process modeling (implicitly underpins much of business workflow automation with LLMs a.k.a. “enterprise AI agents”), and law.

Specific attempts of large labs to turn language (prompts) into reliable specifications (like protocols) include Anthropic’s Constitutional AI (Bai et al., 2022)33 and OpenAI’s deliberative alignment (Guan et al., 2024)34. In “Practices for Governing Agentic AI Systems” (Shavit, Agarwal, Brundage, et al., 2023)35, OpenAI suggested that AI agents can be controlled by three different roles: the model developer, the system deployer, and the user through natural language means:

The model developer does post-training with methods including something like constitutional AI or deliberative alignment (supervening on machine learning methods discussed in the “Learning problem definitions and training data” section above).
The system deployer chooses the system prompt in natural language. (The system deployer could also do fine-tuning, which would also be the application of a natural language-based method supervening on a machine learning method.)
The user sets the goals and instructions for the AI agent in natural language through its messages.

Natural language is somewhat composable, but not very reliably and scalably so.

The examples of hijackability of natural language abstractions are LLM jailbreaks (see Wang, Zhang, et al., 2025) [28] and parasitic memes affecting humans and LLMs alike. In Full-Stack Alignment (Edelman et al., 2025)36, the natural language agent control paradigm is called values-as-text (VAT) and the ways in which values-as-text are being hacked are reviewed, such as through politicized slogans, the instances of parasitic memes.

Compound AI systems, knowledge/memory management, and cognitive architectures

Designs and methods in this class emphasize “wrapping” LLM calls into higher-level system to unlock agentic capabilities, generalization, robustness, and controllability. Examples of such methods include:

End-to-end prompt tuning: see DSPy.
LLM observing other LLMs’ (or one’s own) outputs: the basic method used throughout, such as for guardrails, reflection, planning, etc.
The so-called multi-agent systems (MAS) are usually just the combinations of the previous technique (LLM observing the outputs of the preceding LLM call sequence) and variable role-based prompting.
Execution of a symbolic model that is generated by LLM and feeding the results back into LLM: see the section “Languages, ontologies, and data models for LLM-generated programs, plans, and knowledge representation” above.
Using LLMs as probability samplers within a Bayesian agent architecture: see CLIPS.
Multiple purpose-trained NNs optimising towards a shared objective and communicating in the activation/representation space: see Joint Embedding Predictive Architecture (LeCun, 2022) [3].
Agentic memory though note-taking and note graph curation: see A-MEM (Xu et al., 2025)37.
Knowledge graph management with LLMs: see AGENTiGraph (Zhao et al., 2024)38, System.com’s knowledge graph platform.

The composability of compound AI system and knowledge/memory management methods varies a lot depending on the specific system and design:

DSPy is explicitly design to approach programming language degree of composability through module signatures.
Naive composition of LLMs observing other LLMs’ outputs is probably not very composable, with high risks of correlated failures of the original and the “reviewer” LLM calls, context rot, etc.
The composability (scalability) of knowledge and memory management systems depends on the composability of the knowledge ontology used, if any (for open systems that are not governed by a single entity, it’s impossible to agree on and maintain a single shared ontology). If no formal ontology is used, these systems are limited by the composability of the natural language: see the previous section.

Guardrails, reflection, planning, so-called “multi-agent” interactions, and similar compound AI methods are sometimes implemented with so-called “agent frameworks”, such as LangGraph, AutoGen, or CrewAI. Considering that this CAIS level is being constantly eaten up by monolithic reasoning models, and that it doesn’t seem to categorically increase the composability/scalability and robustness/controllability of reasoning models, it’s remarkable to me how much attention this level attracts.

Agent interaction protocols and contracting

The primary examples here, of course, are MCP (insofar people think about it as a protocol for agent interaction and composition: see mcp-agent) and the A2A protocol. Smart contracts are sometimes brought up, too (Karim et al., 2025)39. See Technologies for Intelligent Voluntary Cooperation by Duettmann, Miller, and Peterson (2022) for a much deeper dive into many other related abstraction levels.

The incumbent (“pre-AI”) abstraction in this class is contract law. Observe that good old contract law is more composable/scalable than shiny new MCP and A2A: for example, MCP and A2A are completely oblivious of interactions between more than two primary parties. So, we should expect agent interaction protocols to evolve in the direction of contract law, or even wholesale adopt it if AI agents become legal persons. See Goldstein and Salib (2025)40 for the argument granting AIs legal personhood.

The hijackability of interaction and contracting protocols is studied under the rubrics of algorithmic contract theory, game theory, and mechanism design (Dütting et al., 2024)41.

Note that while contract law is potentially more composable than other agent interaction mechanisms, the cost of exploits in contract law is much higher because of how slow and costly it is to patch the law compared to protocol specifications and technical mechanisms. Especially if we consider that AI agents could change tactics and act at several orders of magnitude higher pace than humans and (human) organizations.

User interface, human—AI collaboration design

The currently dominant UX pradigms for AI are simply the rehearsals of three very old ideas:

Chat: ChatGPT, Claude, Cursor, etc.
Command line interface: Claude Code, Gemini CLI, etc.
Delegation: OpenAI’s Operator recently rebranded as Agent, Deep Research, OpenAI Codex.

All three fall short on composability with human’s specific knowledge/competence/skill (for example, coding agents effectively force software engineers to become frantic project managers instead of building their engineering skills) as well as the general agency, creativity, and reasoning capacities: they incentivize people to behave in agency-shrinking ways and make it hard to act in agency-expanding ways. What’s worse, all these effects actively undermine human’s willingness (and, eventually, ability) to make up for vulnerabilities (“hijackabilities”) of the lower levels. Academics sneaking prompt injections into papers to fool reviewers who delegate their work to AIs is a good example of this process.

Some higher levels

There are plenty of yet higher classes of levels relevant for AI agency architecture that I won’t discuss in this post, but I want to point to a few interesting ones.

Microeconomic abstractions and platforms: cryptocurrencies, Free Energy Reduction (FER), prediction markets, other agentic markets.

Media and large-scale platforms for human and AI interaction. Kevin Lu writes insightfully about this level in “The only important technology is the Internet”. See also: the /llms.txt initiative, NLWeb (a protocol for conversational web), Agentic Web Interface (Lù et al., 2025)42, Meta’s Metaverse, Jim Rutt’s idea of info agents, Nostr (a permissionless decentralised protocol for free speech, Jack Dorsey’s new favourite).

Physical control, privacy, and data ownership. A lot of people assign very high value to having access to the model weights (“not your weights, not your brain/agent”), despite not planning to fine-tune them and the half-life of agent deployments (perhaps measured in months on average) making self-hosting uneconomical. Hence, this should be due to privacy, surveillance, and censorship concerns.

Relatedly, I’ve proposed the Personal Agents toolkit (as an alternative to cookie-cutter “agent product packages” from “Big Token”: ChatGPT, Gemini, Grok, and Claude) to foster the adoption of open-source agent designs that should in turn enable open-ended innovation at yet higher, institution and governance levels. Intelligent Internet’s Contexts and Open WebUI already embody my vision of Personal Agents to a significant degree.

Cf. Smith, Samuel. “Trust Spanning-layer Protocol (TSP) Proposal.” (2023), slide 61.

Cf. OS as the middle level in the prototypical diversity hourglass architecture (a section in previous post in this series).

LeCun, Yann. “A Path towards Autonomous Machine Intelligence,” 2022. https://openreview.net/pdf?id=BZ5a1r-kVsf.

Zhi-Xuan, Tan, Lance Ying, Vikash Mansinghka, and Joshua B Tenenbaum. “Pragmatic Instruction Following and Goal Assistance via Cooperative Language-Guided Inverse Planning.” arXiv.org, 2024. https://arxiv.org/abs/2402.17930.

As contrasted with compound AI systems-level neurosymbolic approaches such as AlphaGeometry. More on compound AI systems below in the post.

van Bergen, Ruben, Justus Hübotter, and Pablo Lanillos. "Object-centric proto-symbolic behavioural reasoning from pixels." arXiv preprint arXiv:2411.17438 (2024).

Elhage, et al., "Softmax Linear Units", Transformer Circuits Thread, 2022.

Meinke, Alexander, Bronson Schoen, Jérémy Scheurer, Mikita Balesni, Rusheb Shah, and Marius Hobbhahn. “Frontier Models Are Capable of In-Context Scheming.” arXiv.org, 2024. https://arxiv.org/abs/2412.04984.

Of course, this observation is not new. In fact, this is the major theme of AI Safety-adjacent concerns and opposition to scaling frontier LLMs by leading corporations. However, this post shows that this concern is just a single corner of a much larger architecture space that surfaces many more concerns and engineering trade-offs.

Greenblatt, Ryan, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, et al. “Alignment Faking in Large Language Models.” arXiv.org, 2024. https://arxiv.org/abs/2412.14093.

Hwang, Sukjun, Brandon Wang, and Albert Gu. “Dynamic Chunking for End-To-End Hierarchical Sequence Modeling.” arXiv.org, 2025. https://arxiv.org/abs/2507.07955.

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, et al. “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” arXiv.org, 2025. https://arxiv.org/abs/2501.12948.

Xiang, Violet, Charlie Snell, Kanishk Gandhi, Alon Albalak, Anikait Singh, Chase Blagden, Duy Phung, et al. “Towards System 2 Reasoning in LLMs: Learning How to Think with Meta Chain-of-Thought.” arXiv.org, 2025. https://arxiv.org/abs/2501.04682.

Cetin, Edoardo, Tianyu Zhao, and Yujin Tang. “Reinforcement Learning Teachers of Test Time Scaling.” arXiv.org, 2025. https://arxiv.org/abs/2506.08388.

Tang, Yunhao, Sid Wang, Lovish Madaan, and Rémi Munos. “Beyond Verifiable Rewards: Scaling Reinforcement Learning for Language Models to Unverifiable Data.” arXiv.org, 2025. https://arxiv.org/abs/2503.19618.

Shao, Rulin, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon Min, Luke Zettlemoyer, and Pang Wei Koh. “Scaling Retrieval-Based Language Models with a Trillion-Token Datastore.” arXiv.org, 2024. https://arxiv.org/abs/2407.12854.

Bonnet, Clément, and Matthew V Macfarlane. “Searching Latent Program Spaces.” arXiv.org, 2024. https://arxiv.org/abs/2411.08706.‌

Deleu, Tristan, Mizu Nishikawa-Toomey, Jithendaraa Subramanian, Nikolay Malkin, Laurent Charlin, and Yoshua Bengio. “Joint Bayesian Inference of Graphical Structure and Parameters with a Single Generative Flow Network.” Advances in Neural Information Processing Systems 36 (December 15, 2023): 31204–31.

LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, et al. “Large Concept Models: Language Modeling in a Sentence Representation Space.” arXiv.org, 2024. https://arxiv.org/abs/2412.08821.

Wan, Zishen, Che-Kai Liu, Hanchen Yang, Chaojian Li, Haoran You, Yonggan Fu, Cheng Wan, Tushar Krishna, Yingyan Lin, and Arijit Raychowdhury. “Towards Cognitive AI Systems: A Survey and Prospective on Neuro-Symbolic AI.” arXiv.org, 2024. https://arxiv.org/abs/2401.01040.

Silver, David, Satinder Singh, Doina Precup, and Richard S Sutton. “Reward Is Enough.” Artificial Intelligence 299 (May 24, 2021): 103535–35. https://doi.org/10.1016/j.artint.2021.103535.

Chu, Tianzhe, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. “SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-Training.” arXiv.org, 2025. https://arxiv.org/abs/2501.17161.

Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. “Language Models Are Few-Shot Learners.” arXiv.org, 2020. https://arxiv.org/abs/2005.14165.

The kind of “M types x N operations with a narrow waist” architecture discussed in the blog post “The Internet Was Designed With a Narrow Waist” is closely related to bow-ties that are also a part of John Doyle’s architecture theory, and themselves are enabled by diversity hourglass architectures. I haven’t discussed bow-ties in the previous post in this series, but will return to it in the next post.

Daochen Zha, Zaid Pervaiz Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Shaochen Zhong, and Xia Hu. “Data-Centric Artificial Intelligence: A Survey.” ACM Computing Surveys, January 6, 2025. https://doi.org/10.1145/3711118.

Weng, Lilian. “Reward Hacking in Reinforcement Learning”. Lil’Log (Nov 2024). https://lilianweng.github.io/posts/2024-11-28-reward-hacking/.

Sharma, Mrinank, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Newton Cheng, et al. “Towards Understanding Sycophancy in Language Models.” arXiv.org, 2023. https://arxiv.org/abs/2310.13548.

Wang, Kun, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, et al. “A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment.” arXiv.org, 2025. https://arxiv.org/abs/2504.15585.

Knoop, Mike and François Chollet. “How to Beat ARC-AGI by Combining Deep Learning and Program Synthesis,” 2024. https://arcprize.org/blog/beat-arc-agi-deep-learning-and-program-synthesis.

Dalrymple, David davidad, Joar Skalse, Yoshua Bengio, Stuart Russell, Max Tegmark, Sanjit Seshia, Steve Omohundro, et al. “Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems.” arXiv.org, 2024. https://arxiv.org/abs/2405.06624.

Walters, Michael, Rafael Kaufmann, Justice Sefas, and Thomas Kopinski. “Free Energy Risk Metrics for Systemically Safe AI: Gatekeeping Multi-Agent Study.” arXiv.org, 2025. https://arxiv.org/abs/2502.04249.

Thornley, Elliott. “There Are No Coherence Theorems.” Lesswrong.com, February 20, 2023. https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems.

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, et al. “Constitutional AI: Harmlessness from AI Feedback.” arXiv.org, 2022. https://arxiv.org/abs/2212.08073.

Guan, Melody Y, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, et al. “Deliberative Alignment: Reasoning Enables Safer Language Models.” arXiv.org, 2024. https://arxiv.org/abs/2412.16339.

Shavit, Yonadav, Sandhini Agarwal, Miles Brundage, Steven Adler, Cullen O'keefe, Rosie Campbell, Teddy Lee, et al. “Practices for Governing Agentic AI Systems.” 2023. https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf.

Edelman, Joe, Tan Zhi-Xuan, Ryan Lowe, Oliver Klingefjord, Vincent Wang-Maścianica, Matija Franklin, Ryan Othniel Kearns, Ellie Hain, Atrisha Sarkar, et al. “Full-Stack Alignment: Co‑Aligning AI and Institutions with Thick Models of Value.” 2025. https://www.full-stack-alignment.ai/paper.

Xu, Wujiang, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang. “A-MEM: Agentic Memory for LLM Agents.” arXiv.org, 2025. https://arxiv.org/abs/2502.12110.

Zhao, Xinjie, Moritz Blum, Rui Yang, Boming Yang, Luis Márquez Carpintero, Mónica Pina-Navarro, Tony Wang, et al. “AGENTiGraph: An Interactive Knowledge Graph Platform for LLM-Based Chatbots Utilizing Private Data.” arXiv.org, 2024. https://arxiv.org/abs/2410.11531.

Karim, Md Monjurul, Dong Hoang Van, Sangeen Khan, Qiang Qu, and Yaroslav Kholodov. “AI Agents Meet Blockchain: A Survey on Secure and Scalable Collaboration for Multi-Agents.” Future Internet 17, no. 2 (February 2, 2025): 57–57. https://doi.org/10.3390/fi17020057.

Salib, Peter, and Simon Goldstein. “AI Rights for Human Flourishing.” 2025. https://doi.org/10.2139/ssrn.5353214.

Duetting, Paul, Michal Feldman, and Inbal Talgam-Cohen. “Algorithmic Contract Theory: A Survey.” arXiv.org, 2024. https://arxiv.org/abs/2412.16384.

Lù, Xing Han, Gaurav Kamath, Marius Mosbach, and Siva Reddy. “Build the Web for Agents, Not Agents for the Web.” arXiv.org, 2025. https://arxiv.org/abs/2506.10953.

‌

Trace LLM workflows at your app's semantic level, not at the OpenAI API boundary

Roman Leventov — Fri, 11 Jul 2025 06:35:31 GMT

"Stop Prompting, Start Engineering: 15 Principles to Deliver Your AI Agent to Production" by Vladyslav Chekryzhov deserves far more attention than it has received. As a practicing AI engineer, I can tell the article is born from hard-won experience in production. The advice and checklists in that article are very worth following.

In this post, I want to discuss more concretely the implementation of some of the principles from Chekryzhov's article, and the resulting data architecture required for LLM agent/workflow apps to embody these principles. And consequently, my thoughts on AI tracing/observability SaaSes such as LangSmith, Helicone, Langfuse, Arize, etc. TLDR: I think they are ultimately not suited for provider-agnostic reproducibility and experiments that rapid AI agents and workflow development demans.

For predictable latency and reliability, workflows have to support many providers and models and be able to switch them at any step

Here I quote the "3. Model as Config" section from Chekryzhov's article:

Problem: LLMs are rapidly evolving; Google, Anthropic, OpenAI, etc. constantly release updates, racing against each other across different benchmarks. This is a feast for us as engineers, and we want to make the most of it. Our agent should be able to easily switch to a better (or conversely, cheaper) model seamlessly.
Checklist:
Model replacement doesn't affect the rest of the code and doesn't impact agent functionality, orchestration, memory, or tools
Adding a new model requires only configuration and, optionally, an adapter (a simple layer that brings the new model to the required interface)
You can easily and quickly switch models. Ideally—any models, at minimum—switching within a model family

I agree with this. What's more, if you develop a user-facing application (such as a chat bot or a support bot) for which latency is important, going fully multi-model, multi-provider is a must.

Here's an incomplete list of ways a specific model or provider can fail in the specific agent/workflow step:

The provider has an outage. (Hello, Anthropic!)
The provider has blocked/censored/refused your request because it got triggered by something in your dialogue, oftentimes actually benign. (Hello, Gemini!)
The model fails to return a valid response schema or a correctly formatted tool call. (Everyone, including OpenAI, but especially everyone else.) Sometimes the response is seemingly cut half-way, in which case retry to the same provider usually will help. But often, the response just has a wrong schema/format, in which case retries may not help, even with non-zero temperature. In theory, non-zero temperature setting should make model’s outputs more variable, but in my experience, it sometimes doesn’t.
The model has randomly failed tool call formatting and instead outputted calls as code or XML tags in the content. (Hello again, Gemini.)
The model randomly hallucinated specific details that must be right, such as IDs of something that should have been tool call parameters. LLM should have picked up from the context, but didn't. Of course, models hallucinate all the time, but in these cases it could be easily verifiable (e.g., the ID doesn't appear in the context) and hence the request should be retried, preferably with a different model.
The provider updated something small about the exact response format, structured outputs/tool calls schema processing, thinking/reasoning, or the like, and this caused one of the insufficiently flexible layers between your app's semantics and the provider to break. These layers can include: proxy services like OpenRouter, client library like LiteLLM/OpenAI client, AI agent/tool call harness like PydanticAI, or your internal LLM call integration layer.
When using proxy service such as OpenRouter, their updates can also silently break with certain providers or models in certain cases (structured outputs, tool calls, streaming, thinking) or their combinations.
You get a burst of usage and start hitting the throughput limits through one of your insufficiently "beefy" accounts. The remedy for this is proxying everything though OpenRouter, but there are many reasons why this probably makes more harm than good: OpenRouter's own bugs (sometimes not circumventable), extra latency, the single point of failure, etc.
Even when the provider doesn't have an outage overall, they sometimes put certain requests "on hold", in which case it may take them minutes even to start streaming you tokens back. Everyone who programmed with AI have experienced these lags.
If the agent step uses reasoning, it may randomly "overthink" a request, thinking for minutes on end on a relatively simple task. This could be almost as bad as a failure for latency-sensitive AI apps.
- Exact circuit breakers won't help because the agent's thinking is not exactly repeated in a loop, merely highly repeated. "Fuzzy reasoning circuit breakers" would help, but this requires the next level of harness sophistication: an LLM monitoring the reasoning stream. It is very hard to implement, adds to costs, streaming LLM responses are their own can of bugs and worms especially with OpenRouter and LiteLLM, etc.
- Setting hard thinking/reasoning token limits (supported only by Anthropic at the moment, AFAIR?) or "reasoning_effort": "low" usually helps, but is often undesirable: we want to let LLM (provider) to decide how much to think somewhat on their own, based on the difficulty of the request. Requests that genuinely require longer thinking do happen.

I used to think that a primary and a single fallback model (for any specific workflow/agent step) were sufficient. In practice, we found that a chain of at least six(!) model configs was the minimum for >99% success rate and latency below threshold, to safeguard against both period-of-time and stray, one-off issues across different providers, accounts, and models.

For example, in the application that I develop, the core reasoning/chat step currently has the following chain of model configs for fallback:

openrouter/google/gemini-2.5-pro-preview
openrouter/google/gemini-2.5-flash-preview:thinking
openrouter/openai/o4-mini
bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0 -- lower than Gemini Flash and o4-mini because it's often slow, and the response latency is important for our application.
gemini/gemini-2.5-pro -- same as the first model config in the chain, but directly though Gemini API rather than through OpenRouter.
gemini/gemini-2.5-flash

Implementation challenge: providers' LLM interfaces are just different and hardly convertible between each other

While I criticise OpenRouter and LiteLLM above, and they have very convoluted code (OpenRouter is wisely closed source), I don't think it's because these proxy layers are built by bad engineers.

Rather, it reflects that there is a ton of essential complexity in the task of conforming dozens of models and providers to OpenAI API which is a quasi-standard (and itself is a moving target, e.g. reasoning no longer available in API responses). FWIW, Standard Completions would largely remedy this, but it's a very distant future at the moment, if happens at all.

When providers (rather than proxies such as OpenRouter who focus on this challenge) try to provide their own "OpenAI compatible" APIs, they often add bugs on their own. (Hello, Grok!) I'm yet to encounter a provider who would be truly compatible with OpenAI API.

The above challenges of transforming OpenAI-format requests into provider/model-specific requests, and then provider/model-specific responses into OpenAI-format responses are hard enough, but transforming between provider/model-specific formats (OpenAI, Anthropic, Gemini, Bedrock), across the feature matrix: (chat structure, structured outputs/response schemas, tool calls, reasoning, streaming) is practically impossible. We've encountered:

Some models don't support a dedicated "system" prompt, the chat has to start with a "user" message.
Gemini requires message's content to have separate parts for efficient caching, while other "OpenAI compatible" providers are confused by multiple message parts.
Some providers prohibiting consecutive "assistant" messages in a chat, requiring inserting dummy empty "user" messages between them.
Some providers prohibiting "assistant" response (content) and tool_call to be in the same message, requiring to artificially splitting such responses with both content and tool_calls from other providers. (Obviously, this directly conflicts with the previous point.)
Response/structured outputs schemas: what in the ever living f*** is this shambles Google...?
Some otherwise good providers and models don't support "native" reasoning and therefore chain-of-thought reasoning has to be specifically prompted, and then ... tags from the response cut out (see official AWS user guide). But wait, if you need structured outputs, it couldn't be ..., you will need to modify your schema to insert a "general_thinking": field...

Even more generally, different models work best with different prompts (and less capable models outright require more specific prompting), potentially leading to the matrix (agent/workflow step, provider/model) of system prompts and context formats (such as, for the chat summarisation step, keeping the message list as is vs. condensing the message list in its entirety into a single prompt.)

Trace AI workflows at the application's semantic level

All the incompatibilities between and specialisations for different LLM providers and models mean that to build for reproducibility and experiments with model- and provider-agnostic AI agent and workflows with heterogeneous steps (tool calls, structured outputs, reasoning, and streaming) we must trace workflows at the application's semantic level, not at the lower-level "OpenAI-ish" API boundary, and shape the request for the specific provider and model at runtime. We can see this as "late binding with provider's APIs and specific model's damands".

Application's data model can include abstractions like "message" and "role" as in the LLM chat APIs, if the application is genuinely a chat (e.g., in a support chat bot), but it shouldn't be limited to them, and shouldn't be constrained to them.

For example, if some entities are pulled into the LLM context by IDs (e.g., via search), for LLMs to see them the data of these entities should appear somewhere in the system prompt or chat messages's content as plain text. If the workflow trace is captured at the "OpenAI-ish API/formats boundary", as is the case for most LLM tracing and observability services, this contextual data is "fossilized" within the trace. When debugging such a trace, we couldn't easily tell if it was a data quality problem that may have been resolved independently from the AI workflow logic after the workflow took place in production, or it was an actual LLM hallucination that failed the workflow.

Another benefit of this "late binding" approach to tracing of AI applications over LLM observability SaaS is that it doesn't need to store requests for every LLM response because requests can always be re-created from the data. For chat-like applications, the storage overhead for a long-running chats becomes quadratic, as every new turn requires storing the entire history again.

Use immutable data schema/design to unify application's persistence and tracing

If the application's trace should be kept at the same (semantic) data level at which the core application's logic operates, it becomes clear that that they shouldn't be stored stored separately: the "trace database" can be just the production database(s) that use immutable data schema/designs.

Immutable data designs are quite simple to implement with all types of databases:

Relational OLTP databases like PostgreSQL and MySQL have temporal extensions or built-in temporal tables features. Depending on the scale of the application (and how long in the past you would need to keep traces), this may already be good enough. Otherwise, it's possible to setup change data capture into a database in the next category, or better yet, just pick one at the primary storage from the start. Remember that LLM applications would never notice millisecond difference in point query latency between OLTP and OLAP databases.
In OLAP, time-series, and streaming databases, such as ClickHouse, Databend, Apache Doris, StarRocks, Firebolt, TimescaleDB/TigerData, QuestDB, or RisingWave, the combination of the native time column + entity (record) ID identifies the immutable version of the entity's data at the point of time that can be linked from other tables and databases.
Graph and document databases like Neo4j and ScyllaDB naturally lend themselves to graph- or chain-like data versioning with copy-on-write "head" entity updates, a la Git.
Preprocessed tabular data is stored in Hive catalogs or the modern alternatives: Apache Iceberg, Apache Hudi, or Delta Lake, that have built-in table versions that can simultaneously serve as the versions for the entities stored these tables.

The most elegant solution, however, would be to use the actual "database inside out" -- Rama, or "immutable databases" like XTDB for the core of the AI workflow logic. I would recommend these for greenfield AI workflow applications if they are not ruled out by organisation's technical strategy that may prescribe selecting from a certain list of databases that are already in use in the organisation (cf. magnitudes of exploration).

In our chat application, we are store the user interaction session's data in a single value (keyed by the session ID) in DynamoDB, with atomic updates to prevent races with async messages from the user. To simulate immutability, all changes to the session are stored in the same document, in a separate "revision_history" field. Other pieces of the semantic data in our application are also stored in DynamoDB and in MySQL.

Connecting the semantic data traces with LLM responses

If an OLAP database is already used to store the semantic trace of the application, it's best to store LLM responses in a separate table (or multiple tables, one per workflow step type/kind) in the same database, to simplify coding your custom evals interfaces.

Otherwise, I think VictoriaLogs is optimal due to its efficiency, operational and configuration simplicity (no need to set up search indexes for every column! don't need an "ingestion pipeline"!), automatic "flattening" of LLM responses (JSONs) built-in, and the built-in analytics analytics console.

Since all providers already send responses with request IDs, there is no need to re-define these IDs for the tracing tables.

In addition to the raw LLM response fields from the provider, the table includes the workflow ID (which is the same as the "trace" ID) and the entity IDs (pointers to immutable versions of these entities) in the semantic data model that were used to construct the context (request) for this LLM call.

The request IDs whose responses have directly contributed to the creation of a version of application's semantic entity can be added into array column(s) in the table row (or fields in the document in a NoSQL db) that represents this entity version.

By LLM requests "contribute to the creation of the version of the entity" not only by literally generating chat messages, structured outputs, and whatnot that end up constituting the entity's data, but also by directing conditional logic:

Pre-generation: fast LLM classification and/or pre-filtering of the user input or external events
Post-generation: guardrails, checking that the LLM didn't "forget its role" in the conversation.

If a post-generation guardrail rejects an output of a model and the workflow step is retried (and succeeds) with a different provider or model, that original rejected LLM response also contributes to the new entity version, being the input for the LLM guardrail call that gave way for the eventually successful LLM response.

Storing LLM request IDs in the semantic data tables as "denormalised" metadata is much easier to program than doing the opposite, storing the IDs of the data that the LLM requests have contributed to. Denormalisation is not an issue because both the LLM responses storage and entities are immutable.

Still, collecting IDs of all contributing LLM calls at the point of writing down the entity version can become quite burdensome without contextvars (in Python) or their equivalents in other programming languages. To pass LLM request ID "bags" in contextvars between threads (rather than coroutines) in Python is possible with a thread pool executor wrapper that should be used throughout the application code. This reinforces the importance of the keystone principle from Chekryzhov's article: "Own the Execution Path".

In a follow-up post, I'll describe my experience building a custom trace reproducibility/debugging/evals interface using marimo.

Personal agents

Roman Leventov — Tue, 17 Jun 2025 01:40:43 GMT

Motivation

I believe that the most important factor in whether our AI future goes broadly well or poorly is whether people quickly develop effective AI-ready (and AI-enabled) institutions and networks1. In that, I agree with the recent Séb Krier's essay "Maintaining agency and control in an age of accelerated intelligence".

Many academic groups, non-profit orgs (such as Collective Intelligence Project, AI Objectives Institute, Metagov, and Gaia Lab), and even some governmental agencies, such as Taiwan's Ministry of Digital Affairs are currently working on new AI-ready institutions. However, these projects will likely remain theoretical exercises or prototypes unless there is a population of AI-enabled agents (individuals and organisations) eager to coordinate and solve problems together. This is because agents and institutions need each other to develop and grow in capability and sophistication.

Thus, for new AI-enabled institutions to take root and develop, individuals and organisations have to be at least as AI-ready as these institutions.

Effective use of powerful AI and participation in new economic networks (such as stablecoin payments) obviously promises a lot of advantages to businesses. So, the AI modernisation of the business sphere is already well aligned with the standard economic incentives. It doesn't seem to me that this area needs any extra care or push on the margin.

However, for individuals, such incentives almost don't exist. Using AI tools at work is not the same as becoming a person ready to participate in AI-first social, political, and media networks (such as Jim Rutt's idea of the network of personal "information agents").

Currently, people mostly use siloed commercial AI apps from OpenAI, Google, Microsoft, and Perplexity. Although all these and other vendors will soon push agentic AI products aggressively, I suspect that these vendors will be reluctant to permit free exploration of the social or political agency of users because this could be politically risky for them and there is no commercial upside for them in doing this2. So, it's likely that big vendors' agents will keep interacting with the external world on behalf of the users in mostly mundane commercial ways ("plan my next holiday trip") rather than become true companions and faithful representatives of the people in social and political domains, such as setting up a date for the human, recommending a friend, representing the human in a political assembly, and enabling new types of collaboration between people.

From this I conclude that increasing the adoption of truly personal agents could be one of the highest-impact things to do on the margin to enable social, political, or media innovation.

If the above is not a sufficient argument, the wider adoption of personal agents has more positive effects and indirect arguments for working on it:

(1) It reduces the power imbalance between people and corporations: people save money that doesn't go to corporations as subscription revenue. There are no or fewer deplatforming risks, as well as the "traditional" risks of surveillance capitalism and behaviour manipulation pervasive in the so-called attention economy.

(2) Individual human intelligence and agency augmentation is almost by definition the most anti-gradual disempowerment and anti-intelligence curse agenda among various other AI safety and "AI for good" agendas.

(3) Making it easier for people to run their fully personal agents for non-commercial affairs (socialisation, politics, commoning) that should preferably stay this way is a non-market safety project, so we should expect it to be more neglected by default than market safety projects, such as improving AI robustness or steerability. Of course, to be actually useful and widely adopted, personal agents must be robust, steerable, have long memory, and other characteristics equally attractive for business AI agents, but these capabilities are already actively developed in the open source AI agent frameworks (driven by the business demand), so personal agents can leverage these capabilities without differentially pushing them much.

Finally, note that the increasing capability of open-weights LLMs and compute becoming cheaper over time will make personal agents even more relevant in the future because completely private agents will be enabled through the inference of open-weights models on rented GPUs, whereas today open-weights non-MoE models are not sufficiently robust as agents and are not sufficiently "deep" for thoughtful engagement with the human, which limits the practicality of such "completely private" setups (or greatly increases their costs, if someone is willing to host the largest DeepSeek, Qwen, or Llama models all on their own). Also, the increasing robustness of coding agents and DevOps agents (whether they are built over APIs or open-weights LLMs) will itself reduce the crucial barriers for the adoption and usage of personal agents, as I will discuss below.

Personal agents offer mundane value

The personal agents "movement" would descend ideologically from the self-hosted movement, which promotes and advocates for personal, private hosting of apps such as e-mail, calendar, task tracker, photos and file sharing instead of using free cloud services by Google, Microsoft, or Apple.

It's safe to say that the self-hosted movement has failed: it hasn't gained sufficient traction for about 20 years. I think this is because self-hosting of office apps doesn't actually provide any benefits beyond the ideological satisfaction and the reduction of poorly felt risks of deplatforming, hacking, or data leaks.

I'm convinced that good intentions and mundane benefits could go much farther than just good intentions, and AI adoption is no different. It's hopeless to promote personal AI agents that are "safer" or "more private" but otherwise are equally or less useful than the agents offered by big vendors.

Fortunately, I think personal agents have better a priori odds of being adopted than self-hosted productivity apps of the previous era because personal agents do offer immediate and tangible value over agents from big vendors:

Lower cost. Flat pricing (usually, 20$/mo) is inadequately high for most personal users, yet virtually all AI vendors employ flat pricing: from general platforms such as OpenAI, Google, Microsoft, and Perplexity, to personal AI tutors and psychotherapists such as Auren, to professional agents such as Shortwave and Cursor. I may want to talk to my AI psychotherapist just once per month and the cost of compute will be less than 10 cents. All personal agents can be run for just 2-3$/mo on data and app hosting + model inference API costs, which will be well below 10$/mo for most people across all their AI agents and apps.

International availability. OpenAI and many other AI vendors use Stripe for payment processing in limited configurations, so that people without Visa or MarsterCard cards cannot pay for their services. These are a lot of such people in developing countries.

Unified context and usage history (memory). People often talk to AIs across several different vendors, partially because they want to compare results from different base LLMs (and each big vendor ties up the apps with their models) and partially because no vendor offers all agents and apps that the users want. It's impossible to search, query, or reference this tapestry of usage traces. The personal agent platform eliminates this problem by storing all conversation and query history in a single memory layer such as Cognee. Of course, the user could also maintain context boundaries by attaching different agents and apps to different memory system instances.

Customisation. Want some agent to send you a notification every other day? The deep research agent to ignore results from a certain domain or author? Exchange certain information with your family's or friends' agents in specific situations? The coding agent should be able to do this quite reliably with a single prompt against the stock open-source version of the specific agent. By the end of 2025, coding agents should become so capable that non-programmers can rely on such customisation to work (and to warn them if the user asks for something suspicious) without knowing anything about the source code of the agent they want to customise.

Risks

I'm aware that the benefits of the personal agent platform that I mentioned above come with their own risks. For example, keeping all personal agents on a single hosting account increases the blast radius if this account is stolen. Or, the "vibe coding agent" can introduce a vulnerability into the code or simply break it in a saddle way.

These risks seem like the only notable downsides of personal agents, both as a personal choice and as an agenda. I'm still advocating for a wide adoption of personal agents because the benefits seem to outweigh the risks. I also expect that the state of mundane computer security and LLM security (such as against jailbreaks and prompt injection) will get better rather than worse in the next couple of years. If you think I'm wrong about either of those, please let me know.

The simplest and probably the most likely risk with the deployment of personal agents are not clever prompt injections on the web pages that the agent reads or a vulnerability introduced accidentally when the human asks the coding agent to customise another app or agent, but voluntary deployment of agents with malware, dowloaded from "agent sharing" websites (perhaps, the next generation of shareware website) or untrusted Github repositories.

It's simple to play an active positive role in mitigating this specific risk, as well as increasing the trust in personal agents overall and thus foster their adoption: create a directory of vetted agent repositories (and the specific versions and commits within them) and continuously scan them for vulnerabilities with SoTA AI for code security. This is economical to do this necessary work with resource pooling and perhaps public or institutional funding to reduce the risks for everyone.

Levers for fostering the adoption of personal agents

To summarise the above, here's how I see the main areas of work for helping personal agents spread:

(1) Make open source agents more capable and useful at their main tasks than the analogous agents from big vendors. Open-source agents are at a disadvantage because perhaps they will be using stock LLMs with prompting rather than LLMs post-trained specifically for the given agentic tasks. However, for most personal use cases, the difference may be small or non-existent, especially as the capabilities of the stock models increase.

(2) Help open-source agent development ecosystem flourish by reducing the barrier into this kind of development, though agent project templates, scaffolds, and the tested stack of infrastructure pieces (hosting platforms, databases, execution environments, etc.). AgentStack is an example of such a project, however, it's not focused on personal agents. Ideally, agent developers start themselves seek compatibility with the personal agents stack (platform, toolkit) because it will aid their distribution, while the personal agents platform benefits from a wider variety of supported agents.

(3) Eliminate or reduce the barriers for adoption of personal agents, both technical and financial/jurisdictional. This is a crucial learning from the failure of the self-hosted movement: non-programmers must be able to set up their own personal agents platform with deal-simple, short, step-by-step instructions, ideally just to the point of running the manager agent that walks the human through the rest of the process in a dialogue and helps the user to maintain, evolve, and customise their agents. This process should work robustly enough that it gains a reputation of "just working" among non-programmers. Homebrew comes to mind as the example of a project having such reputation.

Wrt. financial and jurisdictional barriers, reduce number of separate payments the humans needs to do and accounts to manage. OpenRouter, Requesty.ai, and nano-gpt.com do a great job at unifying LLM API bills (as well as enabling access), but none of them supports any embedding models (on the other hand, recently, embedding-based RAG falls increasingly out of favour among AI agent developers). LiteLLM does support embedding models, but doesn't onboard small customers yet.

Ideally, there should be a way to unify both model API and hosting bills, but unfortunately it doesn't currently seem to me that best hosting services for the personal agent platform (such as Fly.io) will be eager to enter this LLM proxy business because it will be an unnecessary risk for them. It would be awesome if Fly.io or a similar hosting service (Digital Ocean, Vercel, Render, etc.) proves me wrong.

So, probably two separate bills (and accounts) is the minimum achievable today (with embedding model inference rationed through Gemini Embedding API's free usage limit, for instance).

(4) Support distribution and discoverability of personal agent projects. Currently, it's surprisingly hard to even discover the coolest open-source agent projects on the block, despite them instantly amassing thousands of stars on Github. Perhaps, I don't hang around the right Discord channels or subreddits, but doing either of these things already sounds as a deal breaker if we aim for a really wide adoption. HuggingFace Spaces and theresanaiforthat.com might be good enough, so ideally these and similar platforms would add a tag for projects compatible with the personal agents toolkit.

(5) Create a directory of open-source agents scanned for malware and vulnerabilities (see the "Risks" section above) to minimise the chance of a major hack that can undermine people's trust in personal agents.

Gaia Network is one particular form of such network/institution that Rafael Kaufmann and collaborators in the Gaia Lab have been shaping up.

With a possible exception of Meta, whose positioning and business model writ large may, and ideally should be compatible with enablement of new social institutions. However, in practice, it's much more likely that Meta will move in the exact opposite direction: atomisation of people, which usually can be monetised more easily.

Architecture theory and the hourglass model

Roman Leventov — Wed, 08 Jan 2025 12:19:49 GMT

Everyone is talking about AI agent architectures, frameworks, and protocols at the moment. Let me apply John Doyle’s architecture theory1 lens to this topic, as well as Micah Beck’s hourglass model (2019).

In this first post in this series, I present the key ideas from the architecture theory and the hourglass model. In the next post, I will apply these ideas in the domain AI agents.

The distinction between system levels and layers

The two core concepts in John Doyle’s architecture theory are levels and layers. They are distinct, and it’s important not to confuse them.

Levels

Levels are the levels of modelling (coarse-graining, abstraction, renormalisation, weak emergence, interpretation) of a system. At different levels of modelling of the same system, the whole ontology, and the types of variables in the dynamical model of the system changes.

Theories, ontics, ontologies, and spatiotemporal scales are among loose synonyms for “levels” in this sense, or 1-1 associated concepts, such as, it could be said that a certain scientific or normative theory with a specific ontic/ontology describes and defines a specific system level.

Marr’s “levels of analysis” of a cognitive system (implementation, algorithm, and semantics) are the example of “Doyle’s levels”2. You can describe the human brain/mind, as a physical object, on many levels: molecular dynamics, neuronal dynamics, brain network, circuit, and processing stream dynamics, Hawkins’ reference frame (i.e., inference states) dynamics, reasoning dynamics such as internal monologue aka. chain-of-thought, psychological and developmental dynamics, etc.

In a coherent description stack of a system (such as a human brain/mind), the theories behind all the levels should be weaved together into an abstraction—grounding graph. Beware, this could be a source of confusion: the word “level” may imply that there should be a total order between the levels of analysis of a system. However, there is probably no order, i.e. asymmetric emergence relation, between the levels of brain network dynamics and psychological dynamics.

Within the context of architecture theory, it seems useless to distinguish “mere renormalisations” (such as a transition from neuronal to brain network dynamics) from model-theoretic interpretations that give rise the levels across the semantic hierarchy, such as Marr’s implementation → algorithm → semantics transitions. So, below in this post I will ignore these distinctions.

Layers

In Doyle’s terminology, layers are nested and/or compartmentalised (sub)systems (components, parts), i.e., groups of variables (atoms, elements, objects), separated by Markov blankets (boundaries).

Beware of the terminological confusion: in the fields of network systems engineering (and, more narrowly, protocol engineering) where Micah Beck’s hourglass model and related to it end-to-end principle come from, the term layer is used to refer to [Doyle’s] level. Hence, the internet protocol layers in the OSI stack would be called levels by Doyle. To minimise confusion, I’ll avoid using the term layers and will use other synonyms: subsystems, components, parts, compartments instead.

The phrase “system level” may also be confusing because this may hint at Doyle’s layer when the system boundaries fully nest within each other, such as the egg yolk within the whole egg within the egg with shell. So, I will call levels simply “levels”, or modelling/abstraction/renormalisation levels.

The way the dynamic variables at a certain level are compartmentalised into subsystems (by drawing “imaginary” system boundaries/Markov blankets) is itself a subject of inference for the observer (aka. modeller, rational agent, scientist). Different ways to “slice up” the “whole modelling field” at the given level (aka “the whole system”) into (sub)systems could be more or less useful for specific practical goals that the observer has.

For example, when analysing the brain on the level of neuronal dynamics, neuroscientists may group neurons into subsystems in different ways, such as the cortical layers, cortical columns, or neuronal circuits, and then hypothesise the emergent properties of these subsystems and thus move to a higher level of analysis.

Diversity hourglass architecture

Diversity hourglass is an architecture (a class of systems) with a particular pattern on three successive modelling levels: some model diversity on the lowest level, very little model diversity on the middle level, the most diversity on the top level.

The middle, low-diversity level is called the spanning layer in network systems and protocol engineering literature. Below, I’ll simply call it the “middle level”.

Also, the “hourglass” metaphor refers to two different aspects of this architecture (see more on this in the next section): (1) the low diversity of the alternative models/theories on the middle level, and (2) the relative weakness, simplicity, and generality of these model(s) on the middle level. As network engineering literature focuses more the second notion, they don’t use the “diversity” modifier and call this architecture simply the hourglass architecture (design, model).

The depiction of diversity hourglass architecture, © John Doyle

The prototypical example of diversity hourglass architecture is the modern computing ecosystem (PCs, phones, servers, etc.):

The hardware level’s objects are transistors, wires, electric signals, etc. This level is diverse: there are a lot of vendors, hardware specifications, and of course specific hardware products.

The operating system level vaguely starts from “computing platforms” (such as x86, ARM, or CUDA) with all their objects such as registers, cache levels, platform events, interruptions, and execution modes. They followed by the “OS-proper” objects such as processes, threads, devices, memory maps, descriptors, and mounted file systems. The OS level is not diverse (relatively speaking): there are relatively few computing platforms and operating systems.

The software level’s objects and variables include those that the OS have decided to expose to the user space (and thus are shared with the OS level), such as processes, threads, memory maps, files, and more. This is a source of confusion. Of course, software level also includes arbitrary objects, variables, and abstractions built on top of the OS-level objects and variables, such as elements of the execution models specific to programming languages (e.g., variables, data structures, types, channels, events, etc.), process containers, communication protocols, and more. Even higher, there are domain-level abstractions for specific industries, businesses or organisations, specific software projects within an organisation, specific scopes in project’s source code, etc. Informally, the “software level” refers to all these finer levels (starting from the programming language and up) lumped together. The software level is the most diverse of all three, even more diverse than the hardware level.

The findings of the architecture and the hourglass theories

John Doyle makes statements about the diversity hourglass that could be seen as the main conclusions or “theorems” of his architecture theory. Micah Beck’s calls the main results of his hourglass model (theory) The Hourglass Theorem and The Deployment Scalability Tradeoff. In this section, I summarise these findings.

The relatively low diversity in the middle level of diversity hourglass is exactly what enables diversity on both the lower and the higher levels, provided there are evolutionary processes driving the diversity up in both the lower and higher levels.

Low diversity in the middle level enables the evolutionary processes in the lower and higher levels. I don’t remember if I saw a formal argument for this from Doyle or anyone else, but you can think of the following example: a uniform set of laws and regulations (“laws and regulations” being a level here, “a uniform set” means zero diversity) enables more businesses to evolve, apply themselves and diversify in different product lines, geographies, customer demographics, etc.

However, the low diversity in the middle level is not by itself sufficient to enable diversification in the lower and higher levels. The specific abstraction (theory, model, design) of the middle level matters.

First, middle level’s weakness (also called genericness by Beck), as well as low complexity/high simplicity of the model/abstraction/theory/ontology of the system on this level enable more diversity on the lower levels because such a weak/simple model is simpler to implement (support, enable) by the lower levels.

Note that weakness/genericness and simplicity are closely related concepts (I’m not even sure it makes much sense to distinguish between them), but the low diversity is a categorically different thing. In the context of protocol engineering, low diversity refers to the fact that there is a single, “spanning” protocol that all other systems and protocols implement in the lower levels and use in the higher levels. Whereas weakness/genericness and simplicity are possible properties of that spanning protocol itself.

Second, the abstractions and elements defined (entailed) by the middle level for the higher level may be more or less composable and recombinable, that will the determine the “evolutionary breeding potential” on the higher level.

Beck combines composability with some extra, very informal property of “broad reach” or “broad applicability” and call this combined property generality (nb. difference with genericness mentioned above). Composability determines in part (but perhaps not in full) the “broadness of applicability” through the “computational power” reached by the middle level’s model. The ultimate ceiling here is Turing-completeness. It is reached by many programmatic abstractions, but of course not many other levels in real life, such as law.

Another important property of the interface between the middle and the higher levels that Doyle emphasises is how prone it is for hacking or hijacking by viruses, parasites, and bad actors.

Diversity hourglass’s benefits: diversity-enabled sweet spots and scalability

Doyle proposes that the diversity of level models/theories and subsystem designs at the lower and higher levels in the diversity hourglass architecture (that is enabled by the low model/theory diversity at the middle level, and by weak/generic, composable, and general model design(s) at the middle level, as discussed above) enables combining heterogeneous subsystems (components, parts, layers) to achieve optimal properties for the whole system.

I will not justify the above statement here, please refer to Matni, Ames, and Doyle, 20243.

For example, evolution of human brains have combined fast, but inaccurate “System 1” inference with slow, but accurate “System 2” reasoning to achieve optimally adaptive cognitive performance for humans in their environment. “System 1” and “System 2” here are thought to be two distinct subsystems on a certain level of brain modelling; the diversity of designs and hence operational characteristics is thought to be enabled by the low diversity and genericness of the lower “substrate” levels, such as the levels of neuronal dynamics and neuronal circuits.

Doyle calls these system designs with heterogeneous subsystems/components where the whole system “takes the best from all its components” diversity-enabled sweet spot (DeSS) [designs].

Beck and other authors in the field network systems and protocol engineering focus more on sociotechnical aspects and economic benefits of the hourglass architecture, such as protocol scalability, which in this context means the potential for broad adoption and huge economic utility to be derived from the use of a single “spanning” protocol.

This argument intersects with the informal argument for why low-diversity middle level enables evolution in the higher levels to the fullest: wide interoperability creates a “huge market” which in turn makes experimentation and bets on the higher levels more attractive due to potentially higher returns on successful experiments.

Diversity hourglass’s risks

In the context of AI agents, I’m not sure the scalability benefit of the hourglass architecture is very relevant: the utility of AI agents is probably going to be big enough, and the cost of their development low enough, that a lot of experimentation will happen even without the promises of maximally broad adoption. In fact, this “adoption amplification” effect of the hourglass architecture can be considered a downside when applied to AI agents, considering the potential societal or institutional disruption due to too quick adoption, “slide to criticality”4 and hence the idea that Short timelines and slow, continuous takeoff as the safest path to AGI.5

Other risks of the diversity hourglass architecture are also directly connected to its benefits:

Low diversity of the middle level increases the “evolutionary breeding potential” not only for “good” systems, but also for viruses, parasites, and zombies.
What’s worse, the scale of disruption (the “blast radius”) that could be caused by the viruses is additionally exacerbated by the broad adoption of the given middle level or “spanning” protocol.

Several distinct approaches for addressing these risks have been proposed:

Complete verification (proving) that the models/theories across the entire abstraction/level DAG are not hackable. In the domain of AI (agents), this approach has been called a “Guaranteed Safe AI” agenda (Dalrymple et al., 2024)6.

Multi-level and multi-component (layered) immunity system for fighting viruses and parasites. Immunity systems themselves should leverage the benefits of the hourglass architecture, namely diversity-enabled sweet spot designs and scalability. Apart from the applications of system-level synthesis (SLS) framework that underpins Doyle’s diversity enabled sweet spots theory in control theory7 and game theory8 that are not very specific to immunity domain, perhaps the closest work that I can find that takes this systems immunity perspective is (Ciaunica et al., 2023)9. Yet, in application to AI agents, this approach is even less developed: the closest idea that gained some prominence in the AI space is the Swiss Cheese Model for risk mitigation.

The resilience and safety engineering perspective: see (Dekker and Woods, 2024)10 for a recent work specifically in application to highly automated systems such as AI agents. This is also sometimes called a complex systems perspective, including in Dan Hendrycks’ AI Safety textbook.

Conclusion

Steering towards the diversity hourglass architecture of AI agents11 is not automatically “good” because the diversity hourglass architecture entails both benefits (I haven’t even covered all of them in this post, will elaborate more on those in the following post) and risks.

A thoughtful hourglass architecture for AI agents should proactively mitigate the risks through the combination of

Using provably tamper-proof models at certain levels of abstraction,
Designing “diversity-enabled sweet spot” layered immunity systems alongside the core functionality within this multi-level architecture, and
Accounting for the ideas from resilience engineering such as slide to criticality [4], robust yet fragile, graceful extensibility12, and more.

“John Doyle’s architecture theory” is primarily conveyed through multiple John Doyle’s presentations (you can find them on YouTube) between 2019 and 2022. More materials could be found in Doyle’s public Dropbox folder. For the most unifying and comprehensive published work, see footnote 3.

Or, groups of levels in the abstraction—grounding DAG of levels/models/theories, where the grouping criteria should be that each group is a connected sub-DAG. Ontological distinction between “levels proper” and “groups of levels” looks hopeless to me at the moment, and is probably not that useful anyway, so I will mostly just call both levels and groups of levels simply “levels” below.

N. Matni, A. D. Ames and J. C. Doyle, "A Quantitative Framework for Layered Multirate Control: Toward a Theory of Control Architecture", in IEEE Control Systems Magazine, vol. 44, no. 3, pp. 52-94, June 2024, doi: 10.1109/MCS.2024.3382388.

D. Alderson, J. Allspaw and D. Woods, "Re-architecting tomorrow’s internet for “survivability” (a resilience engineering perspective)", in proceedings of NSF Workshop: Towards Re-architecting Today’s Internet for Survivability, 2023.

The opposite stance on offer here is Nathan Labenz’s “adoption accelerationist, hyperscaler pauser”.

Deglurkar, Sampada, Haotian Shen, Anish Muthali, Marco Pavone, Dragos Margineantu, Peter Karkus, Boris Ivanovic, and Claire J Tomlin. “System-Level Analysis of Module Uncertainty Quantification in the Autonomy Pipeline.” arXiv.org, 2024. https://arxiv.org/abs/2410.12019.

Neto, Michela Mulas, and Francesco Corona. “SLS-BRD: A System-Level Approach to Seeking Generalised Feedback Nash Equilibria.” arXiv.org, 2024. https://arxiv.org/abs/2404.03809.

Ciaunica, Anna, Evgeniya V. Shmeleva, and Michael Levin. “The Brain Is Not Mental! Coupling Neuronal and Immune Cellular Processing in Human Organisms.” Frontiers in Integrative Neuroscience 17 (May 17, 2023). https://doi.org/10.3389/fnint.2023.1057622.

Sidney, and David D Woods. “Wrong, Strong, and Silent: What Happens When Automated Systems with High Autonomy and High Authority Misbehave?” Journal of Cognitive Engineering and Decision Making 18, no. 4 (April 23, 2024): 339–45. https://doi.org/10.1177/15553434241240849.

Saying “designing the diversity hourglass architecture” would not be correct here because this activity is not done by a single person or organisation. The common abstractions and levels that will be eventually most widely adopted depend on a myriad of theoretic proposals, technical innovations, marketing campaigns, and political efforts done by numerous actors. Cf. the architecture in the large concept in systems engineering.

Woods, David D. “The Theory of Graceful Extensibility: Basic Rules That Govern Adaptive Systems.” Environment Systems and Decisions 38, no. 4 (September 10, 2018): 433–57. https://doi.org/10.1007/s10669-018-9708-3.

‌

Differential knowledge interconnection

Roman Leventov — Sat, 12 Oct 2024 12:37:48 GMT

This post is a reply to Eugene Kirpichov's post on Linkedin. Eugene writes that contributing to general AI and information processing capabilities (including GPUs, general LLM technology, general data processing, etc.) is probably harmful overall because these capabilities effectively increase the speed at which the civilisation is moving but doesn't affect the trends, and the trends are negative right now because the civilisation is not on a sustainable trajectory, that is, the civilisation doens't move towards increasing flourishing of all moral patients, human and non-human.

I disagree with Kirpichov's proposition as is because I think it lacks important nuance in definition of general AI and knowledge production capabilities. I think there are different kinds of general AI capabilities, some, I agree, are probably on net harmful, but others I think are not. I will explain the difference below.

Note that this post is not about general vs. specialised AI, i.e., narrow AI applications. Kirpichov and I agree that specialised AI applications should be judged on the case-by-case basis. Different applications may lie on the range from clearly beneficial, such as AI for community deliberation (see the Collective Intelligence Project), AI for human health (e.g., Slingshot AI, HealthcareAgents), AI for ocean ecosystems modelling and monitoring (e.g., Wildflow AI), or AI for decoding non-human communication (e.g., Earth Species Project), to clearly harmful applications, such as AI for spam, phishing, etc., with a thousand shades of benefit and harm in between.

Yet another dimension is the degree of generality of AI capabilities, from very broadly general, such as GPGPU computing capabilities, to very specialised AI capabilities, applicable only in a single narrow domain. The chances of spilling over some capabilities developed for beneficial applications to other, potentially harmful applications should be estimated. The standard example here is AI for drone navigation and autonomy, if developed for low-footprint delivery of goods, may proliferate into harmful applications like drone warfare and killer drones.

Now, to the question of distinguishing between better or worse general AI capabilities.

AI technologies help to create and test more models faster. I use the word "models" in the broadest sense here, including engineering designs, methods, organisational designs, social technologies, psycho-technologies, legal designs (laws, legal structures for organisations, contracts), industrial standards, etc., along with more standard epistemic theories a.k.a. explanations, as well as theories of ethics.

Models that are not refuted by test or practice become knowledge.

I agree with Kirpichov that currently, the civilisation is not on a sustainable trajectory. The civilisation as a whole lacks a lot of knowledge about how a sustainable civilisational design even looks and how to get there from the current state. As David Deutsch wrote, "all evils are caused by insufficient knowledge".

AI capabilities could be used to obtain and leverage "good" knowledge: that is, the knowledge that nudges the civilisation towards a sustainable path. However, just as well, AI capabilities could be used to obtain and leverage "bad" knowledge: that is, the knowledge that exploits the flaws in the current design of the civilisation to gain advantage at the high collateral cost for other moral patients.

Therefore, we can infer that AI capabilities that differentially help or incentivise obtaining and leveraging more "good" than "bad" knowledge are probably on net beneficial.

However, how to distinguish between "good" and "bad" knowledge? Can the usage of some knowledge today be "good" today but "bad" tomorrow, or vice versa? How would we know?

I think we can work on this question "backwards".

"Good" knowledge implies interconnection

When the civilisation is in a sustainable state, all agents whose actions matter for the well-being of any moral patients have to predominantly use "good" knowledge.

By definition of a sustainable civilisation given above, this knowledge should be interconnected with the (subjective/objective) knowledge about all moral patients' states of flourishing1, sourced from them directly when possible, e.g. when their own verbal report is available, or if unavailable, "to the best of our knowledge", that is, using our best methods for inferring flourishing states of mute moral patients.

The phrase knowledge interconnection used above refers to how the models underlying that knowledge have been inferred and tested. "Interconnected inference" means obtaining the joint posterior of the models, and "interconnected testing" means verifying that the two models hold up in practice in interactive scenarios rather than only in isolation. Inference and testing should also not be isolated from each other but rather create a learning loop.

Making the knowledge of every agent (not only humans and AIs but also organisations and states) interconnected with the knowledge about the flourishing states of all moral patients individually would be infeasibly expensive. Imagine that every business or AI agent had to consider the outcomes of all their decisions for every human and animal. Therefore, by necessity, the structures for integrating the knowledge about the states of moral patients and outcomes for them have to be hierarchical2.

Rafael Kaufmann and I have called these structures that ought to exist to align the civilisation with flourishing of moral patients the Gaia Network. The details of the technical architecture that we proposed for "connected model inference", federated learning via credit assignment, and trustworthy "connected model testing" (verification/validation) are not important for this post. These details are also still up for a debate. Also, it's almost certain that if our civilisation will reach a sustainable trajectory, multiple alternative approaches to knowledge connection would coexist.

The big point is that the "good" knowledge embodied in a sustainable civilisation must be interconnected, pretty much by definition of a sustainable civilisation.

Note that the reverse may not be true: a fully interconnected knowledge may not be used towards the flourishing of all patients. As an example, imagine a tight world-wide surveillance regime that doesn't value the flourishing of the people and animals.

However, I don't think it means that we should suppress knowledge connection until we figure out how to ensure that knowledge is used benevolently. In opposite, it seems to me that gradual interconnection of knowledge embodied by the economic agents is one of the very few operationalisable strategies for increasing world agents' circles of concern and care3.

AI and data processing capabilities that foster knowledge interconnection

From the interim conclusions above, we can posit that the kinds of AI and information processing capabilities that have a propensity to be used for obtaining and using interconnected knowledge are probably on net beneficial.

As is inherent to the discussions of differential technology development (see the recent Michael Nielsen's notes on the topic), it's often frustratingly hard even to put a sign on the propensity of this or that capability for "knowledge interconnection". I cannot do this for general technologies like LLMs, general capabilities like causal reasoning and planning.

For example, LLMs can be deployed (and, in fact, essential) both for "knowledge connection", such when they are used for semantic knowledge graph mining, and "disconnected" use cases, such as for automating myriads of tasks in the present "disconnected" business ecosystems.

For another example, capabilities for LLM distillation and miniaturisation may on the one hand foster the creation of knowledge graphs and well-structured public data because "they don't know much by themselves" and therefore have to rely on externalised knowledge, but on the other hand small LLMs make (online) automation much cheaper that accelerates the present harmful trends.

Planning is essential for grounded federated learning and credit assignment, but obviously also the core capability for AI automation that leverages existing "disconnected" knowledge for profit extraction.

Nevertheless, I think I can make at least a few claims relatively confidently.

Capabilities that seem to favor knowledge interconnection:

Federated learning, privacy-preserving multi-party computation, and privacy-preserving machine learning. See Flower AI, OpenMined.org.
Federated inference and belief sharing. Examples: prediction markets like Manifold, Digital Gaia.
Protocols and file formats for data, belief, or claim exchange and validation, such as various blockchain and crypto projects, Solid, ActivityPub, XTDB, or GraphAr.
Semantic knowledge mining and hybrid reasoning on (federated) knowledge graphs and multimodal data, including tabular data. See OpenCog Hyperon.
Structured or semantic search such as exa.ai and system.com.
ML interpretability, including so-called mechanistic interpretability. The usual "theory of impact" of interpretability is ensuring that we know when huge LLMs act or suggest us something for benevolent or nefarious reasons and corrupting LLM's knowledge of dangerous topics via representation engineering. However, in the context of this post, representation interpretability is an essential piece of the research agenda of cognitive science of flourishing, i.e., understanding the (subjective) states of moral patients, potentially AIs themselves(!), but also animals or humans when the representations are obtained by passing through DNNs video, audio, or other metrics about them, brain-computer interface signals, etc.
Datastore federation for retrieval-based LMs.
Cross-language (such as, English/French) retrieval, search, and semantic knowledge integration. This is especially important for low-online-presence languages. See Cohere for AI.

On the other hand, many stock AI capabilities, such as imitation learning, many (though not all) methods of reinforcement learning, image and video generation4, online (browser-based) task automation, collaborative filtering, and other capabilities seems to have very little to do with knowledge interconnection. Although it's probably unfair to also say that most of these AI capabilities favor obtaining and leveraging "isolated" knowledge, either, if we take as the default premise that these capabilities will be mostly used to accelerate the current trends (of leveraging "disconnected" knowledge in the environment where most agents' circle of concern is rather narrow) rather than change them, we should expect the development of these capabilities to be harmful on net.

Knowledge interconnection and the risk of industrial dehumanisation

I agree with Andrew Critch that post-AGI industrial dehumanization is a major extinction risk for humanity.

This challenges the view that the development AI capabilities that favor knowledge interconnection will be on net beneficial if pursued right now. It corresponds to the possible position that I already mentioned above, namely that any intelligence amplification, even of the "interconnected" flavor, only exacerbates risks until we ensure that the knowledge is deployed stably and benevolently towards humans and other moral patients.

Currently, I conclude the development AI capabilities that favor knowledge interconnection on net reduces the risk of industrial dehumanisation. This is because the economy aligned with the flourishing of humans, animals, and natural ecosystems will be much complex and interconnected (hence, lower entropy5) than the pure "machine economy", at least "locally"/temporarily. This means that AI capabilities favoring knowledge interconnection also differentially favor the development and sustainment of complex "human economy" industries, institutions, and social phenomena, such as healthcare, agriculture, education and enlightenment, family and romance, deliberative or liquid democracy, philosophy and ethics, religion, communities, culture, animal and ecosystem preferences, etc.

In fact, earlier I wrote comments that are very compatible with both this post and the recent Critch's post calling for differential development of "human economy" industries. This post focuses on AI capabilities and technologies in specific and attempts to find "good" ones among them, but of course I fully support the development of human- and animal-centered AI applications as well.

Nevertheless, I take seriously the "industrial dehumanisation challenge" to the differential knowledge interconnection thesis that I lay out above in this post. I'm not at all sure about my own conclusions. I'm interested in your thoughtful opinions on this subject.

Relation to the simplicity/acc manifesto

A few months ago, I've drafted the simplicity/acc manifesto:

Apply AI power to create simple software.
Create more a la carte tools (such as debuggers, observability, modellers, simulators, security analysers, verifiers, AI-first DevOps, CI/CD tools) to empower AI to create, maintain, and explain simple software more reliably and effectively.
Create more real-world-facing software than software-facing software.
Make it easier for system designers and developers to receive and account for the diverse feedback from the real world and the stakeholders.
Spend the software complexity “budget” on the essential complexity of accommodating diverse, interacting users’ and stakeholders’ needs rather than on the accidental complexity of “self-consumed” software.

The simplicity/acc manifesto calls for more focus on AI applications rather than the development of general AI capabilities, and specifically applications that "face the real world" rather than other software and software-derived "virtual" economy elements such as finance.

I guess this manifesto looks somewhat conflicted with the differential knowledge interconnection thesis. Although the last two items of the manifesto are very synergistic with knowledge interconnection for advancing the flourishing of moral patients, the second point about "creating a la carte tools for software engineering" looks perhaps slightly antagonistic to it. So, today I would probably remove it from the manifesto.

I don't discuss here the disagreements about what subjective or objective states or "life journeys" of moral patients are really desirable between welfarism, utilitarianism, hedonism, eudaimonism, ecocentrism, and other relevant theories of ethics. Without loss of generality, we can assume that moral uncertainty and ethical portfolio views should be applied and the "goodness" of such and such state of such and such moral patient(s) weighted accordingly.

Fields, Chris. "The free energy principle induces compartmentalization." Biochemical and Biophysical Research Communications (2024): 150070.

Witkowski, Olaf, Thomas Doctor, Elizaveta Solomonova, Bill Duane, and Michael Levin. "Toward an ethics of autopoietic technology: Stress, care, and intelligence." Biosystems 231 (2023): 104964.

A potential counter-point here, per Andrew Critch's "My theory of change for working in AI healthcare", is that image and video generation are mostly used for human entertainment, which is a part of "human economy", and fostering human economy is preferable to fostering "machine economy".

See Beren Millidge's “BCIs and the ecosystem of modular minds”.

Table transfer protocols: improved Arrow Flight and alternative to Iceberg

Roman Leventov — Sat, 07 Sep 2024 10:37:17 GMT

This article is the ultimate one in the five-piece series:

1. “The future of OLAP table storage is not Iceberg” argues for why object storage-based open table formats: Apache Iceberg, Apache Hudi, and Delta Lake, although they may completely cover all analytical querying needs for some data teams, impose several significant limitations and inefficiencies for some OLAP use cases, and therefore shouldn’t be trumpeted as the “future” of columnar table storage.

2. “Proposal: generic streaming protocol for columnar data” proposes a low-level, reactive/asynchronous streaming protocol for data in Arrow or another columnar container format. The proposed protocol occupies the same niche as Arrow Dissociated IPC, but is more general and flexible.

3. “Table transfer protocols for universal access to OLAP databases” describes the M × N interoperability problem between OLAP databases and processing engines which have both greatly diversified in recent years. I discuss the existing solutions to this problem, including the usage of aforementioned open table formats, BigQuery Storage APIs, and Arrow Flight, and describe the limitations of each of these solutions that are too significant to ignore.

Then I propose a new family of table transfer protocols: Table Read, Table Write, and Table Replication protocols. These protocols are layered on the streaming protocol for columnar data proposed in the previous article. These don’t have the limitations of the other solutions to the interoperability problem between OLAP databases and processing engines.

Finally, in that article, I describe Table Read protocol in detail.

4. “Table Write protocol for interop between diverse OLAP databases and processing engines” describes Table Write protocol: a pull-based data ingestion protocol that uses Table Read protocol in reverse: the target database reads the records to write from the writing source, such as a processing engine.

This design is agnostic to transaction, isolation, and write atomicity semantics that the target database uses, but is conducive to ensuring exactly once delivery guarantees on the higher level of abstraction (the transaction or ingestion task management).

Another byproduct of making Table Write protocol based on Table Read protocol is getting an “almost free” distributed table replication method (protocol) from any OLAP database that implements Table Read protocol and into any database that implements the target side of Table Write protocol.

5. “Overview of table transfer protocols” (this post) summarises the most important points from the previous posts. For reference, I’ll generously link to the most relevant sections of the preceding posts in this series throughout the text below.

How table transfer protocols are different from Arrow Flight

Table Read protocol

Unlike Arrow Flight, Table Read protocol affords fine-grained control of

load distribution both on the server (database) and client (processing engine) sides,
resource usage on the server (database) side, enabling database nodes to serve table data for the processing engine concurrently with other real-time queries or data ingestion, and
network traffic between the server and client sides. Network traffic size could be traded off with resource usage on the database side.

Also, Table Read protocol affords consistent reads and resilience in the face of server or client node failures or table data rebalancing/redistribution concurrent with long-running Read jobs (i.e., Table Read protocol interactions).

These properties are achieved in cooperation between the server and client sides of the Read jobs.

Arrow Flight doesn’t afford cooperation between the protocol interaction sides to achieve these properties. If the client accesses the database via Arrow Flight, the database may only take singular responsibility for consistency, resiliency, and load distribution, but this is more complicated and less efficient: cooperative mechanisms for consistency and resiliency are simpler. Therefore, Table Read protocol requires less end-to-end implementation complexity to get consistent reads and resiliency than Arrow Flight.

Table Read protocol also strives to be maximally agnostic about the architectures of both databases and processing engines that could implement it, while enabling optimal read performance between any pairings of the Read job sides. Table Read protocol is supposed to work approximately equally well for

Single-node or distributed servers or clients,
Disk-first, object storage-first, or other approaches to table data storage,
Various approaches to metadata storage,
CPU-only, GPU-only, or mixed processing,
IO-bound or compute-bound query patterns, thanks to flexible network traffic control (see above), and the option to push-down processing to storage nodes.

Table Read protocol is also agnostic about the consistency/isolation model of the database.

Table Write protocol

Unlike Arrow Flight’s bulk ingestion path, Table Write protocol supports distribution on both the writing source and target database’s sides in virtue of being essentially Table Read protocol with server and client sides flipped, with only a few non-trivial additions, primarily needed to support handling the total size of written records exceeding the memory allocated to the Write job on processing engine’s worker nodes, and thus enabling query processing and ML frameworks with strong preference for memory-only operation (such as Bodo, cuDF, Dask, Apache DataFusion, DuckDB/MotherDuck, oneDAL, Polars, Ray, Theseus, and others) to effectively use table transfer protocol in the “ETL loop” with the database.

Table Write protocol is designed specifically for processing engines as the source sides in Write jobs (i.e., Table Write protocol interactions). Processing engines that typically provide consistent-at-the-offset reads to enable end-to-end exactly once delivery/ingestion guarantees for the data that “flows” through them. Table Write protocol exposes at-the-offset reading semantics (instead of hiding them behind the veneer of “simpler” abstractions) to “plug into” the exactly once delivery flow that may include a Write job at the end or as an intermediate data exchange between different systems.

Not coincidentally, the consumption of a shared log (such as Kafka) is also the industry standard for resilient ingestion in distributed OLAP databases. However, since processing engines provide consistent-at-the-offset reads like a shared log themselves, there is no need to stick a Kafka topic between sources and targets (databases) in Table Write protocol.

Scope

Table transfer protocols are lower-level than Arrow Flight SQL. I intentionally leave out of the scope of table transfer protocols:

Data access or writing permissions. This concern is left, for example, to data governance systems like Unity Catalog.
Transactions. They are left, for example, to the database layer if there is only one database system used through the processing pipeline, or to ETL/ETL tools (such as dbt) or distributed transaction systems (such as Apache Seata or Temporal) when there are more than one database, queue, or source/sink service involved in the pipeline.
Table creation, management and configuration of perpetual ingestion jobs/streams (a-la DDL). This logic is left to the target databases. However, insert, update, upsert, delete, and partition-specific semantics (a-la DML) could be specified for record writing/ingestion in Table Write protocol.
The query syntax and semantics, such as SQL. At the moment, I think it’s a good idea for table transfer protocols to embrace Substrait as the only format for expressing queries, partition breakdown of table data, schema of the written records, and negotiation of the processing logic push-back from the server to the client side. Substrait plans can be extended in many ways and at different places, which should be flexible enough to support time-travel semantics in feature stores such as Hopsworks, Chronon, and others.

While table transfer protocols are lower-level than Arrow Flight SQL, they are higher-level than Arrow Flight RPC. Arrow Flight RPC is oblivious to table-level processing logic such as projections, filtering, aggregations, table read consistency, and table data partitioning:

Thus, table transfer protocols are designed to be always used by other data tools and systems: SQL querying layers, transaction managers, processing pipeline planners and orchestrators, semantic layers, ML training, inference, or data science frameworks, feature stores, CDC or table replication systems, etc.

Sometimes, though not always, processing engines and databases that take part in the specific table transfer protocol interaction would also play the role of these data tools, such as when the database is its own SQL querying layer, semantic layer, and transaction manager, the processing engine is also an ML framework, etc.

Regardless, table transfer protocols also permit pulling these functions away from the storage and processing layers in line with the “database disassembly” trend.

Complicatedness

Table transfer protocols are more complicated than Arrow Flight: table transfer protocols define more node roles, request and response types, states the interacting nodes could be in, etc.

I believe that most of this complexity is essential for table transfer protocols to provide

Cooperability between the client and server sides,
Universality wrt. database and processing engine architectures, features, and semantic models,
Flexibility for efficient support of different query types, workload patterns, and heterogeneous compute hardware, and
Composability with higher-level systems and functions.

In particular, as I already mentioned above, cooperative mechanisms to achieve consistent reads, “exactly once” delivery guarantees, resilience, and optimal load distribution are usually simpler than those where only one side of the protocol interaction (usually the database) tries to take full responsibility of achieving these properties, as implied by Arrow Flight.

Note that all the attractive properties of table transfer protocols come just from making these protocols lower level and more tweakable than Arrow Flight, rather than from using some clever or novel techniques and distributed algorithms.

Moreover, achieving some of these nice properties with Arrow Flight may be just impossible by the database unless it has a certain distributed architecture, such as reverse proxy (“frontend”) nodes on the read path for availability and resilience, or a shared log on the write path for “exactly once” delivery. Sure, some databases, such as data warehouse offerings from public clouds have these elements in their distributed architectures. But most OLAP and time-series databases (and especially their open-source/on-premise tiers1) don’t have these elements. Thus, Table transfer protocols enable building resilient and consistent distributed data stacks without obligatory DBaaS and PaaS subscriptions, possibly outside the cloud.

As I noted above, table transfer protocols are designed to be implemented and directly used by database, data processing, and other data tool engineers rather than by data engineers who combine these systems for their data stacks. This fact also makes me think the complicatedness of table transfer protocols is the right tradeoff. For example, the PostgreSQL wire protocol is also very complicated, as it essentially has to be, to provide its functionality. Still, many OLAP databases adopt this protocol because it enables plugging into the PostgreSQL ecosystem.

With all that said, I also shared some thoughts about how table transfer protocols could be made less burdensome to implement by the databases and processing engines here and here. I also don’t claim that I already came up with the simplest possible designs for the streaming protocol for columnar data and table transfer protocols. I welcome your feedback and ideas about their designs!

Table transfer protocols enable the benefits of Iceberg without its limitations

Object storage-based open table formats for data lakehouse architecture: Iceberg, Hudi, and Delta Lake radically solve the interoperability problem between databases and processing engines by removing the database layer from the data stack entirely: data is ingested into and queried from the object storage directly by diverse processing engines. The interface is not as much the object storage API/protocol as the format of the files.

It’s important to disambiguate the pros and cons of using object storage as the data storage service in the lakehouse architecture from the pros and cons of using open table formats such as Iceberg per se.

There are other object storage-only (or object storage-first) databases such as Databend, Firebolt, GreptimeDB, LanceDB, OpenObserve, Oxla, Snowflake, Quickwit, and others who do not use any of the “big three” open table formats (Iceberg, Hudi, Delta Lake) yet share the benefits and limitations of using cloud object storage as the primary data storage service with open table formats.

Object storage

The benefits of cloud object storage for analytical, search, and vector querying are primarily the low cost of “cold” storage that is never or very rarely accessed, and the very high aggregate read throughput for very IO-intensive queries.

The limitations of object storage for analytical, search, and vector querying are:

Object storage-imposed read amplification and network overheads of random file access make certain analytical indexes ineffective.
The “small file problem”: high overhead of small files (and file writing as such) that prevents storing very fresh data in object storage or greatly increases the cost of doing this.

Modern databases avoid these limitations by having a layer of nodes for caching and updating recent table data, metadata, and indexes in memory or on SSDs.

At the same time, these databases can retain both main benefits of using object storage (low cost for rarely used data and high read throughput) by either making this layer completely ephemeral (like Databend, DuckDB/MotherDuck, Firebolt, LanceDB, Redshift, Snowflake, and others do), or by aggressively tiering table data to object storage, like ClickHouse, Doris, Druid, Pinot, SingleStore, StarRocks, and others could be configured to do.

Table transfer protocols (as well as Arrow Flight) can leverage this caching layer because they are just network protocols that databases can implement. However, accessing the data lake through Iceberg or another open table format is poised to miss the freshly ingested data or be ineffective for certain queries that greatly benefit from specialised indexes.

Table formats

Open table formats embody an ambitious and technically impressive idea of removing the database from the data stack by carefully re-implementing its critical functions, namely the management of table schemas, metadata, and transactions, within the so-called catalog component separated from processing engines that do the meat of ETL and query processing.

However, it seems to me that there are some not-technical downsides to the wide adoption of this idea.

Innovation speed and file format inertia

If Iceberg “wins” analytical data stack deployments decisively, it will put the entire industry in an unusual situation when the innovation in table storage formats and performance hinges on a committee design process of on-disk file formats.

This sounds like a recipe for slow progress at first and a legacy drug and “data format inertia” later.

Compliant Iceberg readers (processing engines) should already be able to read Avro, ORC, and Parquet. But these formats are not the last word in the history of columnar file formats: see Nimble, Lance, DeepLake Data Format, BtrBlocks, and Vortex, not to mention the steadily improved internal file formats of ClickHouse, Databend, Doris, Druid, Pinot, StarRocks, and other databases. Iceberg will always face the trade-off of adding new formats for efficiency vs. imposing more implementation burden forever for processing engines.

Because table transfer protocols always assume some compute on the database side, they demand that the database side at least always supports column data transfer in Arrow format, potentially in addition to other, more specialised or optimised formats. (Obviously, this is also the main idea of Arrow Flight. However, table transfer protocols don’t limit themselves to the Arrow format.)

Vendor competition and the “lowest common denominator” effect

There are a lot of OLAP and time-series database technologies and engineers who love to hack on them. If Iceberg takes over the OLAP space as the table storage format, these engineers and companies won’t just decide that area is “solved” and move to work on something else. This also won’t make sense objectively, considering that there are limitations and inefficiencies inherent to storing data only in object storage (as discussed above) and that Parquet is not the “silver bullet” file format that covers all use cases optimally.

It seems much more likely that database and data warehouse companies will start to “build around” Iceberg. For example, they can add custom-made indexes to Parquet files for specific use cases, store them in a separate SSD- or NVMe-based semi-ephemeral storage tier, or collect and cache advanced metadata outside Iceberg’s standard metadata format.

BigQuery already does this: see section 3.3 in (Levandoski et al., 2024), and Snowflake does this, too. These and other vendors will make their “accelerated Iceberg” as functional and efficient as possible, and the metadata and indexes that they propagate to the “true” Iceberg as minimal as possible, to maximise the value added by their platform and minimise the chances that their clients switch to another vendor, while keeping the right to say that they “use Iceberg” to calm customers’ concerns about the vendor lock-in.

I don’t imply that such vendors’ behaviour is wrong: all power to data warehouse vendors to build their competitive advantage!

However, in this scenario, it’s pointlessly wasteful that when users access these Iceberg layers from alternative processing engines and runtimes, bypassing the data warehouse’s native processing layers, they will not leverage the caches, indexes, and extra metadata that the data warehouse maintains anyway.

This arrangement is also not advantageous to data warehouse vendors themselves. First, they spend resources to sync their internal table metadata store with austere Iceberg metadata. Second, when users query Iceberg directly instead of using the data warehouse’s processing layer, the vendor “loses” some processing to another vendor or system, whereas if that processing was done on the vendor’s compute it would have charged the user for it.

Table transfer protocols (as well as Arrow Flight) enable interoperability between databases (data warehouses) and diverse processing engines while not inhibiting the healthy competition among the database vendors: they can use accelerated hardware setups and innovate on the approaches to data partitioning, indexing, and metadata storage while keeping their innovations proprietary if they choose to.

Databases will also retain as much processing compute as they deserve. The client can request a “thicker” or “thinner” table processing plan depending on how much the vendor will charge for these different plans, traded off with cloud provider’s egress costs (if any) of transferring smaller or larger results of the respective table processing plans, and the cost of performing the remaining processing on the client side. Similar calculations can be done for the end-to-end latency of the processing job.

The lock-in question

Another advantage of Iceberg and other open table formats I haven’t mentioned yet is that they provide an ironclad guarantee that the data team can easily move all data to another data warehouse vendor or different cloud provider. This issue is periodically raised with OLAP and timeseries databases that don’t use Parquet or ORC as their data partition file formats, such as Druid and Pinot (see example).

If the database implements Table Read protocol, there is very little extra logic they should implement (just the REPLICATION_INIT and REPLICATION_STEP RPCs: see details here) to support export/replication. Also, the database doesn’t need to run anything in its control plane to monitor the replication job.

Considering the above, it would be hard for database vendors to justify not providing this functionality to their customers: it would appear as straightforwardly anti-competitive behaviour on their part. For open-source databases, it also wouldn’t be hard for the community to implement and maintain replication RPCs for the database.

Finally, many OLAP databases already provide export to Parquet and Iceberg even though they don’t use them as their native table storage format. These databases include BigQuery, ClickHouse, Databend, Doris, StarRocks, SingleStore, Snowflake, and others. When choosing these databases, data teams can be sure they can export their data if they decide to, while table transfer protocols would enable them to enjoy the benefits of interoperability with different processing engines without paying Iceberg’s overhead.

Conclusion

I buy into the composable data systems vision (Pedreira et al., 2023; Voltron’s “Composable Codex”). However, I don’t think Apache Iceberg should be the centrepiece in the future composable data stacks for OLAP, search, and AI workloads.

Arrow Flight protocol is closer to enabling universal database and processing engine composability. However, lower-level protocols would enable the composed data systems to achieve resilience, better performance, load balancing, consistency, and other nice properties more effectively than with Arrow Flight. I’ve called such lower-level protocols table transfer protocols and drafted their design in this article series.

What vendors may be interested in table transfer protocols

Specialised and challenger database vendors who want to innovate on the storage formats and the ingestion architecture (e.g., for hybrid transactional and analytical processing, HTAP), and who realise that they cannot win the entire processing stack and therefore want to open up for diverse processing engines. I think good examples of such databases might be CedarDB, CnosDB, Druid, GreptimeDB, Hopsworks, InfluxDB, LanceDB, OpenObserve, TimescaleDB, QuestDB, Quickwit, and possibly2 Pinot, ClickHouse, Databend, Doris, Firebolt, Oxla, Pinot, SingleStore, StarRocks, and others.

Specialised and challenger processing engine vendors who are currently at a disadvantage to established technologies (Spark, Flink, and Trino) in the number of “source” and “sink” integrations. I think good examples of such processing engines might be Bodo, DataFusion, DuckDB/MotherDuck, Theseus, and others.

Accelerated and specialised hardware platform vendors who want the databases and processing engines to utilise their hardware (compute, networking, and storage) most effectively via the streaming protocol for columnar data. Cf. Nvidia’s Rapids (including cuDF) and Intel’s oneAPI Data Analytics Library (oneDAL).

High-level data systems: ETL/ELT orchestrators and schedulers, change data capture, data movement, and replication tools, semantic layers, BI and operational analytics interfaces, decision intelligence and causal inference algorithms, data science apps and frameworks, feature stores, ML training frameworks, ML inference apps and platforms, data catalogs and governance, and others. High-level data systems can interface with databases and coordinate with the processing engines more effectively with the help of table transfer protocols than with Arrow Flight, and provide better end-to-end consistency and exactly once delivery guarantees.

Project status and what’s next

The work on this article series was sponsored by Rill Data. Rill Data is interested in developing the open data ecosystem rather than promoting any specific database solution. Rill’s technology is compatible with various databases mentioned in this article series, including ClickHouse, Apache Druid, and DuckDB.

I’m interested in starting a working group to develop table transfer protocols, perhaps within the Apache Arrow project. However, this also depends on interest from the vendors of databases, processing engines, and high-level data tools. If you represent some vendor who might be interested in helping with developing table transfer protocols and supporting libraries, please drop me a line at leventov@apache.org.

Appendix: A word about “big” vs. “small” data

(Note: this section was originally published in another article in this series.)

Jordan Tigani recently demonstrated that the vast majority of teams actually don’t have “Big” data, and those who do rarely query it on a large scale, anyway. From my perspective, there is nothing to argue about here, I think Tigani is correct.

However, I design table transfer protocols specifically to address possible performance and efficiency limitations of Iceberg, even though most data teams will likely never encounter these limitations, or may save only pennies on their small workloads. Isn’t there a contradiction here?

My “theory of impact” is not that table transfer protocols will save that much money for most data teams, and query latency for most use cases. Rather, the performance and efficiency of table transfer protocols should assure data teams in choosing OLAP databases with native table storage formats instead of open table formats like Iceberg.

Data teams may think that data and query volumes may increase in the future, and therefore they need a “grown-up”, Big Data solution that is guaranteed to scale. Many of these teams end up being wrong, as Jordan Tigani finds.

My response to this is not trying to argue these data engineers out of their affinity to “big data”, object storage-based solutions. I think individual data teams are often right when they choose the data lakehouse architecture: they rationally hedge against their own risk, as the cost of industry-wide inefficiency. Table transfer protocols should help the data teams to hedge against this “scaling risk”, thus enabling them to pick more efficient solutions from the beginning.

Sometimes, a commercial cloud-based tier of an open-source database is essentially a different system because it can more easily leverage the same primitives that data warehouses from public clouds use.

However, the databases in this group seem to have ambitions to own the processing stack more thoroughly.

Table Write protocol for interop between diverse OLAP databases and processing engines

Roman Leventov — Wed, 04 Sep 2024 13:20:21 GMT

In the previous article, I argued that there is an opportunity for creating a new family of protocols: table transfer protocols, namely Table Read and Table Write protocols to address the M x N interoperability problem between OLAP, timeseries, vector, and search databases on the one hand and data processing/query and ML engines on the other hand.

I discussed why existing technologies and protocols that tackle this interoperability problem, namely ADBC (Arrow Database Connectivity), BigQuery Storage APIs, open table formats (Apache Iceberg, Apache Hudi, and Delta Lake), and Arrow Flight are insufficient and have different downsides. Arrow Flight comes closest.

I’ve already described the common design principles for table transfer protocols and a more concrete proposal for Table Read protocol. In this article, I propose semi-concrete designs of Table Write protocol and a table replication method based on Table Read and Table Write protocols. In this series’s next and final article, I will review table transfer protocols again and discuss table transfer protocols’ place in the disassembled database architecture and data stack.

Use cases for Table Write protocol

The ETL loop

The need for distributed columnar writing (data ingestion) appears when a processing engine has completed a distributed computation job (after pulling data onto multiple worker nodes from some database(s) using Table Read protocol, or from a data lake) and wants to write the results back into the database into a new table, or update existing rows.

Arrow Flight doesn’t permit data ingestion from distributed writer nodes.

Open table formats (Iceberg, Hudi, Delta Lake) permit distributed writes, but if the target database uses these table formats, the read path of Arrow Flight becomes unnecessary: processing engines can read table data directly from the object storage.

On the other hand, open table formats are inefficient in many OLAP and HTAP use cases, as I argued in “The future of OLAP table storage is not Iceberg”. So, it would be sad if the recent diversification of processing engines and ML runtimes pushed more data teams to use Iceberg as their only OLAP storage, unnecessarily leaving efficiency on the table and locking into their dependence on cloud object storage.1

Thus, Table Write protocol’s primary motivational use case is distributed writing of processing job (or ML inference) results from the processing engine’s nodes back into the database.

This use case enables closing the loop:

The processing engine does a distributed read of input data from the distributed database using Table Read protocol,
The engine performs a distributed processing job,
The engine writes the results back into the database in parallel from multiple worker nodes or GPUs via Table Write protocol.

In the above loop, columnar table data is not passed through a single node at any point, or possibly even through CPU memory at all2.

Table Write protocol can also be used to ingest data from a single node (a single data transformation/generation process, or a single-node source database). Even in this case, Table Write protocol should have advantages over Arrow Flight: writing different data partitions to multiple (distributed) database nodes, and the possibility of achieving resilient writing with exactly once delivery guarantees with less end-to-end implementation complexity than Arrow Flight would require3, similar to how Table Read protocol enables consistent and resilient reads with less end-to-end implementation complexity than Arrow Flight.

Distributed table replication between different database technologies

Another target use case for Table Write protocol is distributed table replication via a combination of Table Read and Table Write protocols. The main difference from the above use case (“the ETL loop”) is that there is no processing engine between the source and target databases.

Additionally, such a table replication method would enable the most efficient and consistent (a la change data capture) import/export of tables between different databases that use different storage formats, without intermediary Kafka or Parquet/Iceberg storage.

The best Table Write protocol is Table Read protocol in reverse

The primary way to ingest data with exactly once delivery guarantees into a distributed database is to give database nodes access to a shared log of records to write (or commands/operations to execute) and let the nodes atomically commit (or reach a distributed consensus about) the read offsets within the log. If during ingestion some node crashes or becomes unavailable to the part of the cluster that maintains the consensus, the database’s cluster manager restarts consumption from the offset in the log of the last record that has been consumed and processed persistently (as known to the database’s consensus).

On the source data ingestion path (i.e., before ETL/ELT), the role of this shared log is usually played by Kafka or similar systems: Kinesis, Google Pub/Sub, Redpanda, WarpStream, and others. These systems are populated with data from IoT gateways, change data capture agents, log and metric collectors, etc.

The approach with database-side offset management is recommended in KafkaConsumer documentation to achieve exactly once delivery guarantees. This approach is also extensively described in Apache Pinot’s design document for partition-level Kafka consumption.

Apart from exactly once delivery guarantees, this pull-based ingestion approach also permits the database to control node assignment for ingesting each data partition, and change this assignment at any time. The latter may be needed in the event of database cluster scale-up, scale-down, or data rebalancing that happens concurrently with a long-running data ingestion job.

With push-based ingestion, this flexibility is only possible if there is a layer of proxy nodes in front of the “core” database server nodes, or if the database’s cluster and data distribution management is fully externalised to Kubernetes (the same cluster that hosts the nodes that do the “push”). Neither of these is a common OLAP database setup.

Yet another advantage of pull-based ingestion from a shared log is “almost free” replication, as database replica nodes (for a given table, partition, or data segment) can all consume data from the same log, instead of transmitting records between each other after one of the replicas have consumed them. The latter approach (“internal replication”) requires extra burdensome background processing, internal buffers, etc. This could be entirely avoided with pull-based ingestion from a shared log.

BigQuery Storage Write APIs offer concise, push-based ingestion API with exactly once delivery guarantees by providing a simplified version of Kafka producer API. In other words, BigQuery internalises the shared record log mentioned above. If the client pushes data into BigQuery via consumption from another log or a system that offers consistent-at-the-offset reads of records to ingest, such as a Flink or Spark Streaming processing pipeline, BigQuery’s approach becomes wasteful: in effect, there are two duplicative systems providing semantics of a log in front of the “core” database nodes.

Per Table Write protocol’s design use cases, writing sources are specifically processing engines such as Flink or Spark that already provide offset or block-based consumption in their “sink” interfaces, or other databases (when Table Write protocol is used for table replication) that could provide consistent-at-the-offset reads via Table Read protocol. Therefore, it would be wasteful for Table Write protocol to use BigQuery’s push-based approach to ingestion.

It turns out that Table Read Protocol already takes care of most aspects that would be nice to have in an efficient pull-based ingestion protocol, namely at-the-offset read consistency, load distribution and resource usage control, fault tolerance, control of transfer encodings for columns, and streaming of columnar data (including GPU off-loading).

This leads to a decision to make Table Write protocol essentially Table Read protocol in reverse: that is, the source side (such as a processing or ML engine) in the Table Write protocol interaction (called the Write job below) acts as the server side in the “underlying” Table Read protocol interaction (aka the Read job) and the target side in the Write job acts as the client side in the underlying Read job.

Another benefit of using Table Read protocol in reverse as Table Write protocol is getting a no-intermediary distributed table replication method “for free”. All databases that implement the server side of Table Read protocol can act as sources in the Write job. See the “Table replication method” section below for more details.

If a processing engine implements both the client and server sides of Table Read protocol for the “ETL loop”, it could also use Table Write protocol for inter-stage data exchange. DataFusion Ballista uses Arrow Flight for this purpose.

Table Write protocol: details

Node roles

I’ll use the same terms for node roles as in Table Read protocol.

The Write job’s orchestrator/controller node role on the source side (such as a processing engine) is still called Coordinator.

The orchestrator/controller node role on the target side (i.e., the database the data is written into) is called Agent.

The nodes on the source side that hold the data to be written and will send it to the target are called Servers.

The target side’s nodes that receive and store data are called Clients.

If either the source or target side of the Write job is a single-process data-framing library runtime, an embedded database, or a single-node database, the above node roles are played by threads or coroutines within that process.

Out-of-scope: transaction semantics

As I described in the previous post, table transfer protocols are “headless”. This means that table transfer protocols don’t define transaction semantics for ACID guarantees.

Transactions are usually managed by the target database4. Some OLAP databases use the standard SQL transaction protocol (BEGIN…COMMIT, or implicit autocommit of SQL statements) and others have different ways to organise ingestion: for example, Apache Druid has “indexing/ingestion tasks”. Also, many OLAP and timeseries databases manage continuously running services to ingest data from Kafka or other perpetual data sources.

Thus, Write jobs are supposed to be embedded in either:

Transaction protocol interactions, that are jointly managed by the transaction manager and the transaction client (roles) on the target and source sides, respectively. An SQL console is a typical example of a transaction client.
Data ingestion tasks/jobs/streams, that are jointly managed by the ingestion manager and (optionally) the ingestion client (roles) on the target and source sides, respectively. Here are some examples of ingestion clients: a process with a SparkContext or a SparkSession, an ETL/ELT tool, a table replication system, a feature store, or an ML/MLOps platform.

Transaction and ingestion clients should interact with the target database directly before starting the Write job, and then “commit” the results after the Write job is completed if the database requires explicit transaction commits or completion signals for data ingestion jobs.

Write job initiation

Transaction Client or Agent → Coordinator: READ_FOR_WRITE Agent location, job context, table processing plan, partitioning hints

READ_FOR_WRITE is an optional step in the Write job. Upon receiving this request, Coordinator behaves the same as when it prepares the step #2 response in Table Read protocol, except that it sends the response as a WRITE_INIT request to the given Agent location and with the given job context.

READ_FOR_WRITE request may be sent by Agent itself in a replication job (see “Table replication method” below for details), or when Agent retries to get Server locations of temporarily unavailable partitions. In this case, if Agent doesn’t know its own public address, it may use the reuse-connection:// schema convention from Arrow Flight.

Coordinator → Agent: WRITE_INIT job context, write operation, field names, [ partition: (filter, relation, [ location: (URI, serving hints), ]), ] Response: Write job ID

The job context includes all information needed to anchor this Write job on the target side: the target table name, the transaction ID, the ingestion task/job/stream ID, the ID of the “root” Write job to track their completions (see the section “Handling Server restarts” below), etc.

Coordinator receives the context from the transaction or ingestion client, or Agent itself via a READ_FOR_WRITE message or by other means.

When Agent receives a WRITE_INIT request, it immediately creates an ID for this new Write job and sends it back to Coordinator before doing its other logic, described in the “Agent’s logic” section below.

Record writing (DML) and output semantics

The write operation (a Substrait’s WriteOp + extra fields for upsert behaviour definition, such as the default values for unspecified fields) determines the semantics for writing the records in this Write job: whether they are inserted or upserted into the target table, or update existing rows, or delete existing rows.

However, WRITE_OP_CTAS (Create Table As Select) write operation is not supported: table creation semantics are out of scope for Table Write protocol. The transaction client should create the table separately before letting Coordinator initialise Write job(s) via WRITE_INIT requests.

On the other hand, there is a type of write operation that is not currently codified in Substrait but would be useful to implement for Table Write protocol, in particular for the cases when it is used for table replication: the record writing semantics are determined by some column in the records themselves. This would be useful when the source database in a table replication job natively uses “tombstone” bitmaps to indicate deleted rows in data segments5.

Note: Substrait WriteRel’s OUTPUT_MODE_MODIFIED_RECORDS is not supported in Table Write protocol because it doesn’t seem there is any use case for it. Even returning just the number of modified rows may not be possible in databases that use the “write first, compact/deduplicate rows later” approach to upserts, updates, and deletes. The number of modified rows can be requested along with other statistics of the Write job(s) from the transaction or ingestion manager, via some protocol specific to the target database.

Partitions

The list of partitions in the WRITE_INIT request is the equivalent of WriteRel’s input relation: the latter would be the union of partitions’ relations. At the same time, this list of partitions is a simplified version of Coordinator’s step #2 response to Agent in Table Read protocol, except that slices are omitted (it could be assumed that there is a slice wrapping each partition).

The partition is defined by its filter (a Substrait expression), in the same way as in Table Read protocol. See more on partition semantics in Table Read protocol description.

Each partition’s relation (a Substrait’s Rel) defines the physical schema of the written records in the partition, as they will appear on the wire between the source’s Servers and the target’s Clients. Relations are specific for each partition because there may be different fields in different partitions, for example, if the source side is another database in a table replication, and the schema of the table in the source database has been changed. The same field may also have different types in different partitions as long as the target database can upcast all these types to the column type in the target table.

The relation shouldn’t include the partition’s filter. If all partitions in the Write job share the same relation, it’s possible to send it only once and omit them from each partition structure. Field names are always extracted in this way, as they should be shared between partitions: that’s why partition’s relation field is a Substrait’s Rel rather than a RelRoot.

When the source side is a processing engine, the locations list of each partition should typically have just one element, i.e., the location of the processing engine’s worker node that holds the processing results. However, if the source side of the Write job is a database (in the course of a table replication job), partitions may have multiple locations that can enable better load distribution on the source side and higher overall speed of the table replication job.

The location list for a partition may also be empty in response to a READ_FOR_WRITE request. This indicates that the partition is temporarily unavailable. Agent should handle this in the same way as in Table Read protocol: periodically repeat READ_FOR_WRITE requests where the original table processing plan specialised to the unavailable partition’s filter and the parent Write job’s ID as the “root” Write job ID. See also the section “Handling Server restarts” below.

Multiple Write jobs within a single transaction or ingestion task/job/stream

Coordinator can initiate multiple Write jobs within a transaction or an ingestion task/job/stream: Coordinator can send multiple WRITE_INIT requests to Agent with different lists of partitions. These Write jobs may overlap in time.

Writing from a perpetual data source, such as a stream processing engine like Flink or hybrids of databases with stream processing systems like Materialize, RisingWave, and others, can be implemented as a series of Write jobs within a single ingestion task/job/stream. Yet, every Write job has a limited number of partitions, and every partition’s filter is “closed”, such as time BETWEEN $X and $Y rather than time > $X.

Agent’s logic

Upon receiving the WRITE_INIT request from Coordinator, Agent, Coordinator, Clients, and Servers generally behave according to Table Read protocol, starting from step #3.

Of course, in determining which target database’s node should act as Client for each partition, Agent should consider the target side’s existing data segment placement on the database nodes, data distribution rules for the target table, and the current load of database nodes. In other words, the difference from the “standard” client-side load distribution logic as described for Table Read protocol is that the target side of the Write job (i.e., the client side of the underlying Read job) may be stateful if database nodes themselves are stateful.

If the source side’s VPC rules prohibit inbound connections, Agent can send to Coordinator the Clients locations that will consume data from the given partition locations. The source side’s Servers must establish connections with the corresponding Clients to enable the transfer of data from Servers:

Agent → Coordinator: WRITE_PRE_CONNECT_CLIENTS [ (Server location, Client locations), ] Coordinator → Client(s): WRITE_PRE_CONNECT_CLIENTS Server locations Client → Server(s): WRITE_PRE_CONNECT

Note that the connection from Coordinator to Agent is already established via the WRITE_INIT request that makes the WRITE_PRE_CONNECT_CLIENTS request possible.

If Clients get swapped in the course of the Write job, Agent can send WRITE_PRE_CONNECT_CLIENTS requests repeatedly to update the lists of Client locations.

Handling Write jobs larger than source Servers’ memory

If the source side’s Servers are transferring more data in this Write job than they can hold in memory at once (or the processing engine prefers to reserve memory for other concurrent jobs), Coordinator must set Servers’ serving parallelism limit to one for these Servers in the WRITE_INIT request.

After that, Servers must break down large partitions into multiple sections 6 as part of consumer info that each Server sends to Client at step #5 of the underlying Read job, even though there may be no difference in the offered per-column transfer encodings between the sections (which is the original “justification” for the concept of sections in Table Read protocol).

To permit multiple target database’s replica nodes ingesting the same partition as Clients concurrently, Servers must also track serving parallelism and enforce the limit per Client, rather than in total across Clients, and Agent must send partition consumption tasks to Clients at step #3 and synchronise their completion such that only two Clients that request the same partition take advantage of Server’s per-Client tracking of the serving parallelism limit.

By doing what is described above, Coordinator, Agent, and Servers cooperate to execute sequential consumption of the partitions and sections by Clients. Then, the source side can leverage WRITE_PROGRESS messages to dismiss successfully ingested partitions (and sections within partitions) promptly. This enables Write jobs with the total size of written records larger than the size of memory allocated to these jobs on Servers through which the records flow.

Write job progress

After a Client successfully consumes a data section from a Server (i.e., a columnar data streaming protocol interaction for the section ends with a COMPLETE signal) and successfully persists this data to its disk or object storage, Client writes the number of consumed and persisted rows to the target side’s metadata store, or, perhaps, an ephemeral metadata “substore” specific for the Write job and tied to this Write job’s parent transaction or ingestion task/job/stream lifetime, as supported by Oxia.7

Then, Agent executes the following logic concurrently with its Read job logic:

ingested_row_counts =
  new map>
min_ingested_row_counts = new map

async for (partition, consumption_token, ingested_row_count) updates
  in metadata_store:
    ingested_row_counts[partition][consumption_token] =
      ingested_row_count
    if size(ingested_row_counts[partition]) < replication_factor:
        // Not all replicas have reported any progress for the partition
        //   yet, skip
        return
    min_count = min_ingested_row_counts[partition]
    new_count = min_value(ingested_row_counts[partition])
    if new_count > min_count:
        min_ingested_row_counts[partition] = new_count
	Agent → Coordinator: WRITE_PROGRESS
	  partition, new_count

Agent listens to the metadata store. When it sees that all replica Clients, identified by their consumption tokens (note the metadata store schema is internal to the target side, and thus this ID must not necessarily be equal to the consumption token as it appears in Table Read protocol, although this is a natural choice), have consumed N rows within the partition, which is higher than the previous N for that partition, Agent sends WRITE_PROGRESS messages to Coordinator.

There is more than one replica Client (usually, two) if the target database wants to piggy-back Table Write protocol to get “almost free” replication, as I mentioned above in the section “The best Table Write protocol is Table Read protocol in reverse”, and as illustrated in Apache Pinot’s documentation here.

When Coordinator receives a WRITE_PROGRESS message, it registers them in its own metadata store (to support high availability and Write job restarts) and forwards them to the relevant Servers. Servers then release resources (memory, primarily) used to serve the data sections that are now completely ingested, to enable Write jobs larger than source Servers’ memory: see above.

Handling Server restarts

If some source’s Server crashes or is restarted, Client sends to Agent a step #6 message in Table Read protocol. Then Agent sends to Coordinator a READ_FOR_WRITE request (equivalent to step #7 in Table Read protocol) where it retransmits the job context that it received from Coordinator in the WRITE_INIT request (adding the current Write job’s ID as the “root” Write job ID if there is none yet in the job context), partition.relation & partition.filter as the table processing plan, and empty partitioning hints.

Coordinator responds with a new WRITE_INIT request which may have a more fine-grained partition breakdown, perhaps if the old Server crashed exactly because the original partitions were too big and didn’t fit in memory. And even if Coordinator returns a single partition with the same filter as before, the new Server may break this partition into sections differently than the old Server.

When Client(s) reconnect to the new Server and initiates streaming of new partition’s sections, it sends the row_offset equal to the already ingested (consumed and persisted) number of rows in the whole original partition at step #10 of the underlying Read job.

The new Server may figure out if any rows should be transferred in the given section and the given (partition) row offset either very cheaply, or the new Server may need to “shadow consume” the section without sending the actual data. If the section needs to be skipped completely, Server responds to the step #10 request from Client with a streaming protocol’s COMPLETE signal immediately.

If Coordinator broke down the original partition into several smaller partitions with different Server locations, Coordinator must order partitions in its WRITE_INIT message such that they coincide with the original partition’s row order, and Clients must consume the new partitions sequentially to send the correct row_offset (subtracting the number of consumed rows in the preceding partition) to each of the new partitions.

Write job completion

Table Write protocol has implicit completion: Both Coordinator and Agent must keep track of partitions in the Write job (as listed in the WRITE_INIT request) whose transfer has been completed, as Agent signals to Coordinator in WRITE_PROGRESS messages described above. Coordinator and Agent must also keep track of the tree of Write jobs as created by WRITE_INIT requests with specified “root” Write job IDs in the job context.

When Coordinator and Agent see that the transfer of all partitions in this Write job is complete, as well as all child Write jobs for the “root” Write job, Coordinator and Agent without any additional communication with each other both release all resources associated with this Write job and signal Servers and Clients involved in this Write job to do the same.

Write job abortion

As transaction and ingestion task/job/stream semantics are not defined in Table Write protocol, it doesn’t need to be concerned with abortion and error messaging.

The Write job might fail due to a data (consistency) conflict, crashes of the target’s Clients, or it could be aborted externally due to overload.

If the Write job abortion originates from actions on the level of Table Write protocol rather than higher levels, Agent registers the abortion signal and error messages with the transaction or ingestion manager rather than the source side’s Coordinator.

It’s the responsibility of the transaction/ingestion manager to send the termination signal to the transaction/ingestion client, which in turn propagates this termination signal to Coordinator. All these interactions are out of the scope of Table Write protocol: they happen according to the already existing protocols between the target database, the processing engine (or another source, such as another database, in the context of table replication), and the system or process that acts as the transaction or ingestion client.

Then, Coordinator sends signals to Servers to release all resources associated with the Write job. These signals are internal to the source side and thus out of scope for Table Write protocol.

As per Table Read protocol, Agent and Coordinator must monitor each other’s aliveness. When their counterpart appears unresponsive to them, they terminate all connections and release resources on their side of the Write job. They should report this to their transaction/ingestion manager or transaction/ingestion client, respectively.

Table Write protocol: implementation notes

It should be relatively simple for most processing engines to implement the source side of Table Write protocol, i.e., the server side of Table Read protocol. They can leverage their implementations of inter-stage (and inter-node) data exchange.

If the processing engine already implements the server side of Arrow Flight protocol, as DataFusion does, implementing the server side of Table Read protocol becomes even simpler: Table Read protocol is “progressive”, which means the implementation could only support “core” features that are mostly equivalent to Arrow Flight. The primary difference would be the streaming protocol for columnar data instead of the standard FlightData streaming in Arrow Flight RPC, or Arrow Dissociated IPC.

On the target side of Table Write protocol, i.e., the client side of Table Read protocol, the databases could reuse much of the heavy lifting that they have done to implement pull-based Kafka ingestion, and combine that with other heavy lifting that they have done to implement importing and external querying of data stored in columnar formats, such as Parquet or Arrow.

However, despite the reuse of concepts and logic with Table Read protocol, implementing the server and client sides of Table Read protocol for a database (the latter is necessary for it to act as the target side in Table Write protocol) are completely separate efforts.

Table replication method

An efficient table replication method across database types and storage formats is possible via a combination of Table Read and Table Write protocols: the source side of the Write job is the database from which the table is replicated, and the target side of the Write job is the target database into which the table is replicated.

The table replication method automatically inherits all the nice properties of Table Read and Table Write protocols: read consistency, resiliency, efficient load distribution, and the controllability of resource usage, all on both sides of the replication process.

The proposed table replication method consists of a series of Write jobs. I’ll call Replicator the system that acts as the transaction or ingestion client for these Write jobs (see “Node roles” section above). Replicator’s logic looks as follows:

if target_db uses task/job/stream-based ingestion:
    ingestion_task_id =
      ... (Start ingestion task/job/stream in target_db.)

Replicator → Coordinator (source_db): REPLICATION_INIT
  processing_plan, replication params
Replicator ← Coordinator: REPLICATION_PHASE
  augmented_processing_plan

repeat until cancelled (maybe with delay between steps):
    if target_db uses transactions:
        // Note: EXECUTE_COMMAND (and TRANSACTION_RESULT below) are
        //   dummy. Actually, Replicator uses the transaction protocol
        //   specific to the target DB, such as the PostgreSQL wire
        //   protocol that many OLAP databases adopt.
        Replicator → Agent (target_db): EXECUTE_COMMAND
          "COPY INTO $target_table
           FROM $source_db WITH $augmented_processing_plan"
        // Executed on Agent: context = {table=target_table, tx_id=...}
        Agent → Coordinator: READ_FOR_WRITE
          reuse-connection://?, context, augmented_processing_plan, ...
        ... (The entire Write job happens in background.)
        Replicator ← Agent: TRANSACTION_RESULT
          result // Success or Failure, after the Write job is complete.
    else: // target_db uses task/job/stream-based ingestion
        context = {task_id=ingestion_task_id}
        Replicator → Coordinator: READ_FOR_WRITE
          agent_location, context, augmented_processing_plan, ...
        ... (The entire Write job happens in background.)
        Replicator ← Coordinator:
          result // Success or failure, after the Write job is complete.
    if result == Success:
        Replicator → Coordinator: REPLICATION_STEP
          augmented_processing_plan, replication params
        Replicator ← Coordinator: REPLICATION_PHASE
          augmented_processing_plan // New plan

REPLICATION_INIT and REPLICATION_STEP requests from Replicator to Coordinator with REPLICATION_PHASE responses make most of the interesting work here.

Upon receiving a REPLICATION_INIT request, Coordinator augments the given processing plan (it may be a simple column selection and projection from the source table) with extra special predicates that ensure consistent reads. These extra predicates are opaque to Replicator and the target database’s Agent, and their nature depends on the source database’s approach to consistent reads. See more details at the link above.

Upon receiving a REPLICATION_STEP request, Coordinator inverts that extra predicate in the given (previous) augmented processing plan. For example, if that extra predicate has the form segment.snapshot_number <= $N, Coordinator returns a new processing plan augmented with a predicate like segment.snapshot_number > $N and segment.snapshot_number <= $X, where X is the latest segment snapshot number.

Index, statistics, and segment metadata replication

Table Read protocol permits transferring column indexes, as well as column- and section-level statistics (they could be treated as a kind of index) alongside column data to make replication faster at the cost of higher network traffic.

When the source and target databases are two separate instances of the same database technology, the source side’s Servers can also transfer arbitrary internal data segment- or column-level metadata in extension or metadata fields in its step #5 messages, or Arrow’s metadata as transferred in the streaming protocol.

Note the difference in the approach to metadata transfer in this table replication method from the approach of open table formats: Iceberg, Hudi, and Delta Lake, who externalise the exact metadata storage format and medium, i.e., object storage. The table replication method based on Table Write protocol offers greater flexibility: the metadata may be stored in raw files on object storage like in Iceberg and Delta Lake, in HBase or another wide column store like in Hudi, in ZooKeeper or Raft-based consensus system with in-memory caching like in ClickHouse, in Databend, Druid, and Pinot, in a distributed key-value store such as Cassandra like in Facebook’s Tectonic, in files directly on worker or database nodes alongside the data like in systems built with Oxia, in front-end server’s key-value store with in-memory caching like in Doris and StarRocks, in a relational DBMS such as Aurora, AlloyDB, or Cockroach like in Quickwit, in a disaggregated columnar storage like in BigQuery, or in hybrid and tiered ways where different parts of metadata are stored in different formats and on different mediums.

The approaches to storing metadata may be completely different in the source and the target databases in the replication job, but they don’t need to concern about the metastore architecture and format on the opposite side.

Replication of “delete files”

The source database can emulate replication of data segments with different writing semantics a la Iceberg’s delete files with the help of the new write operation type proposed in the section “Record writing (DML) and output semantics” above: the writing semantics are determined by a column within the record itself. However, the source database’s nodes may transfer the same value in this special column for entire partitions’ sections, and its transmission is almost free with Arrow’s Run-End Encoding for the column (as transferred on the level of the streaming protocol for columnar data). The alternative design is making write operations specific to each partition in the WRITE_INIT message rather than shared.

This is because without the read throughput provided by public cloud’s object storage, such as with on-premises setups of OSS object storage like MinIO where the number of nodes in the object storage layer doesn’t greatly exceed the number of nodes doing query processing, analytics queries on tables stored in Iceberg, Hudi, or Delta Lake formats would run even slower.

The streaming protocol for columnar data proposed for use in Table Read and Table Write protocols permits loading columnar data directly from NVM-based storage to GPU memory and back, using transfer off-loading via libfabric or UCX.

In fairness, if the database’s architecture is built around a shared, durable log (like WAL), and natively implements distributed transaction management, implementing Arrow Flight-style SQL bulk ingestion becomes relatively simple, unlike Table Write protocol, which would require significant development effort and complexity anyway. I believe that some HTAP databases, such as Google’s AlloyDB and CedarDB (see Schmidt et al., 2024) have these features and thus can offer a resilient data ingestion interface with exactly once delivery via Arrow Flight protocol more easily than via Table Write protocol. However, the vast majority of distributed OLAP databases don’t have a shared durable log in their architecture, and thus for them offering a resilient data ingestion interface with exactly once delivery guarantees (and without sticking Kafka in between) would be simpler via Table Write protocol.

The emerging alternative is cross-system transaction managers such as Apache Seata, Temporal, etc. to orchestrate transactions on top of multiple OLAP and other databases.

I use the term “data segment” to disambiguate between physical partitions at the source and logical partitions as defined in Table Read protocol.

In Table Read protocol, there is an extra abstraction called column group, so, more precisely, sections belong to column groups within partitions, not partitions themselves.

If the target side doesn’t implement Agent’s high availability, the metadata “store” could be just Agent’s memory, and Clients “write into this metadata store” by sending messages to Agent.

Table transfer protocols for universal access to OLAP databases

Roman Leventov — Mon, 26 Aug 2024 18:54:07 GMT

Update: for the overview of table transfer protocols, see the following article.

The problem of OLAP data interoperability

The field of OLAP, columnar, time-series, search, and vector databases is burgeoning. There are a lot of serious projects under active development that deliver state-of-the-art performance in different use cases. These include AlloyDB, BigQuery, CedarDB, ClickHouse, CosmosDB, Databend, Apache Doris, Apache Druid, Firebolt, GreptimeDB, Hopsworks, InfluxDB, LanceDB, OpenObserve, Oxla, Apache Paimon, Apache Pinot, QuestDB, Redshift, RisingWave, SingleStore, Snowflake, StarRocks, Synapse Analytics, TimescaleDB, Quickwit, Vastdata, VictoriaMetrics, and many more.

Lots of innovation and diversification are also happening on the side of processing and ML engines. Apart from the usual suspects: Apache Spark, Apache Flink, Apache Hive, and Trino, there are Azure ML, Bodo, cuDF, Dask, Apache DataFusion, Dremio, DuckDB/MotherDuck, Timeplus Proton, Polars, Ray, SageMaker, Theseus, Velox, Vertex AI, and others.

Many OLAP databases can also act as processing engines, which is especially useful for cross-database joins. BigQuery, Redshift, Azure Synapse, Snowflake, ClickHouse, Databend, Doris, StarRocks, and others can do this.

The huge diversity on both the database and processing engine sides presents the classical M × N interoperability problem.

Established processing engines such as Spark and Trino have spent a lot of effort building efficient bespoke integrations with many popular OLAP databases: e.g., see the list of Trino connectors. But even Spark and Trino don’t cover the “long tail” of columnar databases and time-series stores. When the developers of new databases want to make their tables efficiently accessible from many different processing engines, they have to write a lot of custom integrations. Similarly, new processing engines are less appealing if they can efficiently read the data from (and write to) only a few databases.

In this post, I propose a solution to this interoperability problem: a family of table transfer protocols: Table Read protocol for querying, Table Write protocol for data ingestion, and a table replication method on top of Table Read and Write protocols.

Existing solutions are not enough

ConnectorX, ADBC, Arrow Dissociated IPC

ConnectorX and ADBC can take advantage of data partitioning and distribution on the database side, but not the processing engine’s side, that is, when the processing engine consists of multiple nodes. So, it works well for single-process dataframe engines, such as Pandas and Polars, but not distributed processing in Spark, Trino, etc.

Arrow Dissociated IPC is a point-to-point transfer protocol that can “accelerate” ADBC but doesn’t add client-side distribution.

Using ADBC with Arrow Flight does provide processing engine-side distribution semantics. However, Arrow Flight has other shortcomings that I discuss in the section “Arrow Flight protocol” below.

Open table formats: Iceberg, Hudi, Delta Lake

Another solution that has gained popularity recently is making the OLAP databases store data in an object storage-based table catalog, based on either Apache Iceberg, Apache Hudi, or Delta Lake table formats. Then, processing engines can read the table data from the object storage (or distributed file system) in these standardised table formats.

However, as I explained in the previous article, object storage-based table format design fundamentally limits the performance and efficiency of queries. This is especially relevant in OLAP use cases with high query volumes, the need for low query latency, or real-time data updates.

Iceberg, Hudi, and Delta Lake also restrict the data partition file formats to either Apache Parquet, ORC, or Avro. This stifles innovation in columnar data file layouts.

BigQuery Storage APIs

BigQuery Storage APIs (Read API and Write API) have been designed by Google exactly for interoperability between multiple underlying table storage formats, and thus they abstract from the storage details. Storage access efficiency and scalability were also essential design criteria1 for these APIs.

Here’s a nice picture by Google showing BigQuery (BigLake) Storage APIs’ role as the interoperability layer:

Thus, on the purely technical level, BigQuery Storage APIs already mostly fit the bill of table transfer protocols that I’m proposing.

However, there is a huge non-technical issue with BigQuery Storage APIs: Google completely controls their development. This makes it very unlikely that both databases and processing engines will adopt the BigQuery Storage APIs, especially considering that both the subject databases and processing engines usually directly compete with BigQuery and Google’ Vertex AI engine.

Also, BigQuery Storage APIs are too specialised for BigQuery’s managed storage and cloud environments. For example, these APIs are not optimised for direct interaction with storage nodes, but only for access through proxies that most open-source databases don’t even have in their designs. This point is discussed further in the section “Table Read protocol overview and comparison with Arrow Flight” below.

Arrow Flight protocol

Arrow Flight consists of two protocol layers: lower-level Arrow Flight RPC and higher-level Arrow Flight SQL. Arrow Flight RPC layer is responsible for the distribution of data transfer among possibly both database (storage) nodes and client (processing engine) nodes.

Arrow Flight can be combined with Arrow Dissociated IPC for accelerator-aware point-to-point data transfer.

Arrow Flight is a subproject of Apache Arrow. Thus, Arrow Flight’s development and governance are open, unlike the development and governance of BigQuery Storage APIs.

However, there are also a few very unfortunate limitations in Arrow Flight.

First, Arrow Flight SQL defines distribution only on the read (query) path. The write path (data ingestion) connects a single writer node and a single database (storage) node. Write distribution is needed when the processing engine wants to write the results of a distributed job back into the database. If the storage is based on open table formats (Iceberg, Hudi, or Delta Lake), the table catalog can organise parallelised writing, bypassing Arrow Flight. However, most distributed OLAP databases don’t want to use open-format tables as their primary storage because this would limit their query performance and efficiency, as I noted above.

Second, Arrow Flight always transfers columnar data in Arrow columnar format. This inflates the network IO, CPU, and memory usage of storage nodes if all that is needed from these nodes is to read highly compressed column data from disks and send this data to the processing engine or to replication workers, without any filtering or transformation.

This network IO inflation makes Arrow Flight rather inefficient in some of the use cases of open table formats: cross-cloud backup/replication and large-scale JOINs where no pre-aggregation or row-level filter can be pushed down to the level of storage nodes. This is unfortunate because it means that even if the data team is happy with their OLAP database’s performance in user-facing queries, and even if this database supports Arrow Flight access, there are lingering use cases that are not supported well and can make the data team to second-guess their choice and perhaps even switch to open table formats, sacrificing the performance and cost efficiency gains.

This adds to the gravity of open table formats like Iceberg and prevents diverse OLAP and time-series databases from rejoicing in the diverse ecosystem of processing engines without at least the very cumbersome and wasteful always-on, two-way sync between database’s native storage formats and metadata management, and open table formats.

Arrow Flight also misses some interesting possibilities for improving extensibility, resilience, data availability, and load distribution. See the section “Table Read protocol overview and comparison with Arrow Flight” below for elaboration.

Considering all these factors in aggregate, and that Arrow Flight is still implemented by very few databases2, I think there is a strong case for creating a new set of table transfer protocols, drawing the ideas and learnings from

Open table formats: Iceberg, Hudi, and Delta Lake, and the catalogs that implement authentication, authorization, and replication for these table formats,
Distributed OLAP databases and data warehouses: ClickHouse, Druid, Pinot, Doris, StarRocks, BigQuery, Snowflake, Redshift, and others,
Distributed storage systems that achieve certain properties with the help of “fat clients”, such as Facebook’s Tectonic and OK.ru’s S3-compatible storage,
BigQuery Storage API,
Arrow Flight and Arrow Dissociated IPC,
“Computational storage” in Memoria,
RSocket: a protocol providing Reactive Streams semantics,
Accelerator-aware async data transfer protocols: PyTorch RPC and Mercury,
Flexible columnar data formats: Nimble, Lance, and Vortex,
Distributed metadata management: Oxia,

and other systems.

Table transfer protocols

First, disclaimer: in this and the following articles, I don’t aim to describe the table transfer protocols in complete detail. If such protocols are to be developed, they should be designed openly with inputs from many people with diverse perspectives and expertise. So, the descriptions below are sometimes not very precise.

My two primary goals with these protocol descriptions are to demonstrate that:

Improvements too significant to dismiss are possible over the current design of Arrow Flight.
A single set of protocols can cover all functions of open table formats like Iceberg, for most data teams, including atomic distributed table writes and replication.

However, I describe only Table Read protocol below to reduce the size of this article (already way too long). I describe Table Write protocol and a table replication method in the following article.

Basic principles

In the protocol design that I present below, I tried to follow three main principles:

Progressive: there are simpler and more advanced versions of distribution and work splitting, columnar data encoding, transport, and other aspects that different sides of the protocol interaction can support. The sides negotiate using the most suitable and efficient methods that they both support.

Cooperative: reliability and resilience, speed and resource efficiency, trust and security can be achieved most effectively when all nodes participating in the protocol interaction on both sides act cooperatively.

Not opinionated: the protocol doesn’t specifically favour

Cloud or on-premise setups,
Disaggregated object storage or disk-based storage,
Single node, distributed, or serverless databases and processing engines (clients),
IO-bound or compute bound-processing patterns,
Any specific query language or API: SQL, Spark APIs, dataframe APIs, etc.

Cross-cutting and out-of-scope concerns

I omit the discussion of authentication, access authorization, security, and proxying aspects in the protocol descriptions below.

Authentication and encryption could be added straightforwardly with proven mechanisms already used in Arrow Flight, BigQuery Storage APIs, and many other protocols.

Access authorization I think should better be left to higher-level systems that build on top of table transfer protocols:

SQL querying and transaction (BEGIN..COMMIT) facades
Table replication systems
Data governance proxies/facades such as Unity Catalog
Semantic layers such as Rill, Hashboard, Cube, etc.
Feature stores, ML training and inference systems

Consequently, the primary concerns of these and other similar systems are out of scope for table transfer protocols.

Note the difference from Arrow Flight SQL, which covers SQL queries and transactions. Thus, the table transfer protocols described below are lower-level than Arrow Flight SQL.

At the same time, table transfer protocols are higher-level than Arrow Flight RPC. Arrow Flight RPC is oblivious to table-level processing logic such as projections, filtering, aggregations, table read consistency, and table data partitioning:

For other differences from Arrow Flight, see the section "Table Read protocol overview and comparison with Arrow Flight” below.

Node roles

I’ll call a Read job the entire distributed Table Read protocol interaction between all node roles.

Within this article, I use the following terms for node roles participating in the Read job:

Agent is responsible for orchestrating the Read job on the client side, i.e., the side of the processing engine or other table data consumer. This role can be played by Spark Master, for example.

Coordinator is responsible for orchestrating the Read job on the server side, (also called database side interchangeably), i.e., the side of a database, a storage system, or other table data producer.

In databases with query broker and data server separation, such as Apache Doris, Apache Druid, or Apache Pinot, Coordinator’s role can be played by the query broker node/process (called Frontend/FE in Doris, and Broker in Druid and Pinot) that serves the Read job. In databases without such separation, such as ClickHouse, Coordinator’s role is played by any database node.

Clients are the workers who request and consume some portion of the requested table data (this portion is determined by Agent). This role can be played by Spark’s Executors, for example. If the client side is a single process, such as a local dataframing library, Client can be implemented by a bunch of threads within the same process with Agent’s logic.

Servers (identified by locations) are the nodes or serverless data access processes that serve portions of table data to Clients and optionally do some storage-side processing of the data, such as row-level filtering or aggregation.

This role can be played by stateful database processes such as ClickHouse or Doris’s Backend node, or serverless functions that read table data from disaggregated table storage, such as LanceDB or MotherDuck.

Also, the processing engine’s workers can play the role of Servers for subsequent stages in the multi-stage processing plan, or in Table Write protocol interactions. I will discuss this in more detail in the next article.

Base RPC and session layers

All protocol steps can be layered on top of gRPC with Protobuf or FlatBuffers encoding, although this is not significant for any of the protocol design aspects, and is merely a “default” choice that is already used by Arrow Flight protocol and BigQuery Storage APIs.

The exception is accelerator-aware column streaming: see steps 11-… of Table Read protocol, described in the section “Streaming column data from Server to Client”.

Table Read protocol walkthrough

The initial request and Coordinator’s logic

1 . Agent → Coordinator: table processing plan: Substrait, partitioning hints 2. Agent ← Coordinator: [ slice: (filter, processing plan, [partition,]), ] // partition’ type: (filter, [ location: (URI, serving hints), ])

Table processing plan

The table processing plan (encoded in the Substrait format) is an arbitrary processing plan with filters, projections, grouping and window aggregations on a single table or relation. The relation doesn’t need to be materialised on the server side before the beginning of the Read job: in fact, it could be the output relation of a multi-table join that the server side performs concurrently with serving the output to the client. Alternatively, if Agent (such as Spark Master) specifically wants to do a heavy join on the client (Spark’s) side, it can start two Read jobs in parallel and then plan them together. So, the requirement for the processing plan to be single-table doesn’t principally constrain the applicability of the Table Read protocol, but limits its scope and simplifies implementations.

Agent doesn’t need to know beforehand whether the server side supports the aggregation function or other pieces of the submitted plan. If the server side can’t do some parts of the plan, Coordinator can push the corresponding parts of the table processing plan back to the client side, retaining only what it can perform on Servers, as described below. This permits very “thin” server sides, such as Parquet (ORC, Lance) format-aware object storage nodes to participate in Table Read protocol, assuming that the client is a processing engine whose Clients are ready to take up the table processing plan, potentially involving distributed row shuffle between Clients.

This “partial processing plan push-back” is included in the scope of the protocol, rather than, for example, Coordinator simply responding with an error to plans that the server side can’t execute because the whole plan can inform the slice and partition breakdown that Coordinator returns to Agent.

Slices and partitions

Coordinator returns to Agent a set of slices. Slices are minimal (i.e., most granular) groups of partitions, such that completing the table processing plan on the client side doesn’t require any data exchange between different slices. If Agent goes on to task one Client with consuming and processing data of each slice, these Clients (each dedicated to their slice) won’t need to exchange any data with each other to compute a subset of the results of the table processing plan. However, Agent cannot create a Client large enough to process all of the slice’s data, Agent will need to schedule intermediate shuffling or data movement stages. See the section “Agent’s logic” below for further discussion.

Slices are related to data distribution on the client side. Partitions are related to data distribution on the server side of completing the Read job.

If Coordinator hasn’t pushed any processing logic back to the client side, Coordinator must return each partition wrapped into its own slice (1-1 correspondence), because partitions will already contain the results of the table processing plan.

In the context of this protocol description, slices and partitions are logical and are defined by their filters: boolean-valued functions encoded in Substrait. Partition may not coincide with the notion of data partition that may be used within the database, such as a physical file or section containing a portion of the table data. Some databases also use the term data segment for the latter; below, I’ll also call them “data segments” to prevent confusion with “partitions”, as I use “partitions” in a different meaning here.

An informal expectation set by the protocol is that the server side only returns partitions that Servers can effectively fetch without touching (most of) the rows that don’t match the partition filter. This can be achieved by checking the partition filter against the data segment’s metadata, or if Servers have indexes for these columns (or have the data segment’s rows sorted by these columns, which traditional databases call “cluster indexing”). However, partitions could be more granular than data segments: one segment corresponds to multiple partitions, less granular: one partition includes multiple segments, or a mixture of these: more granular along some dimensions and less granular along others.

Neither Agent nor Clients have to be able to execute the partition filter in its entirety, for example, if the database uses a proprietary hash function for data partitioning, or if they use a partition key that doesn’t end up a result column. Still, filters have to be encoded in Substrait so that Agent can extract the remaining partitioning information from them to inform the workload distribution among Clients.

Along with each slice filter, Coordinator returns to Agent the slice processing plan that Servers could still apply to the rows in this slice on their side, before returning the results to Clients3. The slice processing plan could be a simpler part of the table processing plan, potentially all the way to simple fetch of the input table’s columns and returning them to Clients without any processing on the server side. For consistency, each slice processing plan also includes the corresponding slice filter.

The reasons why Coordinator may push processing back to Clients for the slice could be:

None of Servers that host the slice’s data can perform the table processing plan in full, e.g. if Servers don’t know how to apply the specified aggregation, or
The table processing plan has a window or a group-by aggregation that demands that rows should be moved between Servers, and the database doesn’t want to do any distributed processing for this Read job and prefers to push it back to the client side that is specialised in distributed processing.4

For example, if Agent sends to Coordinator the table processing plan that includes a group-by on a column that is not part of the database’s distribution/partitioning key for the table, Coordinator can return a single slice whose filter is taken from the table processing plan, and the slice processing plan which is a simple fetch of the input table’s columns. In other words, Coordinator has pushed all processing apart from the predicate pushdown back to the client side.

Note that the processing push-back may not be uniform across all slices. The partitioning strategy for the table in the database may differ for recent data and older data segments. This may affect the capacity of the Servers that host this or that partitions to perform the table processing plan on their side.

Slice and partition breakdown

Filters of all returned slices must logically add up (i.e., if OR’ed together) to the table processing plan’s filter, and the filters of all partitions within a slice must logically add up to their parent slice’s filter.5

Partition filters can include conditions that determine data distribution/segmentation on the server side, as well as conditions on sort columns or columns with range access indexes within individual data segments. Since Agent may not know the database’s internal data partitioning scheme for the table, let alone the indexing strategy, Agent doesn’t know a priori how many slices and partitions Coordinator will return, and what extra conditions will Coordinator include in slice and partition filters.

The slice and partition breakdown returned by Coordinator is partially determined by the distribution of data segments on Servers6: Coordinator cannot return fewer partitions (and, therefore, slices, because every partition can belong only to a single slice) than is necessary to reflect the differences in the lists of Server locations that host different partitions (see the section “Partition locations” below). For example, if one data segment is stored only on Server A and another segment is stored only on Server B, Coordinator cannot return fewer than two partitions in total. Meanwhile, it depends on the semantics of the table processing plan whether Coordinator will put these two partitions within a single slice (if the execution of the remaining parts of the table processing plan will require moving around partition rows), or can put each partition into a separate slice.

The above consideration may still under-determine the slice and partition breakdown. Coordinator may have a latitude in augmenting the slice and partition filters with different columns from the database’s partitioning key for the table, or with different columns that are indexed on the data segment level.

If Coordinator has pushed back to the client side a part of the table processing plan that includes a time-based window aggregation or a grouping aggregation, and the time or grouping columns can be included in the partition filters, Coordinator must do this in order to help the client side reduce fan-in/out factor of the ensuing shuffle or aggregation.

Further, partition sizing can depend on some client-side factors:

The maximum number of Clients that can consume data in this Read job,
Whether these Clients all consume the same data (such as, if the client executes a broadcast join with the table consumed in this Read job) or not, and
The priority and latency requirements of this Read job: such as, whether this Read job is a piece of an interactive user-facing query or a low-priority background job.

As well as server-side factors, such as the priority of the database to serve this client, or permit this client to consume only a small portion of the database’s network and disk IOPS and bandwidth, so that the latency of the main query load and data ingestion traffic in the database is unaffected by this Read job.

The main tradeoff in partition sizing is the following: more IOPS and processing overhead on the server side (bad for the server side, if it minds this) could enable better saturation of the bisection bandwidth between Servers and Clients, or more fine-grained or favourable distribution of workload on the client side.

Designing a generic algorithm or protocol for negotiating all these factors between Agent and Coordinator is not the goal of this article. I’ve called partitioning hints whatever information Agent will send to Coordinator in this negotiation at step #1.

Perhaps suffice to say here that there exists a baseline heuristic that doesn’t require any negotiation and should work at least as well as the read path with open table formats (Parquet, Hudi, or Delta Lake): Coordinator can size the total number of partitions that it returns to Agent to approximately match the number data segments that pass the table processing plan’s filter. These partitions could correspond to data segments one-to-one, or group together approximately N data segments whereas approximately 1/Nth of each data segment’s rows are selected. As was noted above, this “alternative partitioning” could be motivated by the grouping or time-based aggregation in the table processing plan.

Partition locations

For every partition, Coordinator returns not a single Server location, but several locations that can serve that partition. This is possible if the database’s serving replication factor is greater than one.

Returning several locations enables Agent to optimise the physical placement of Clients and load distribution between Servers. Coordinator does not know how to optimally distribute the load between Servers yet because Coordinator doesn’t know the details of the overall distributed job that Agent is doing (the Read job that is the subject of Table Read protocol is a part of this overall job, but perhaps not the only one). In turn, these details may not be determined by Agent before it receives the partition breakdown and Server locations from Coordinator, so Agent cannot pass all the necessary information to Coordinator at step #1. See the section “Agent’s logic” below for further discussion.

Coordinator returns serving hints alongside each location. Serving hints could be specific for each unique partition—(Server) location pair, not just location alone. Serving hints could include:

Performance priority hint can define the order among locations that Coordinator deems optimal. This may be informed by
- The data sorting and indexing may be different in different locations and thus more or less conducive to their parent slice processing plan (and the expected additions to these plans with grouping columns at step #3).
- The extra costs: a location could point to a serverless function (e.g., AWS Lambda) that will pull the partition data from the object storage (e.g., S3). The interaction with this location will cost extra money for both serverless runtime and object storage access. Still, this could be a viable fall-back location if the “primary” Servers fail, or cannot sustain the Read job’s bursty bandwidth requirements.
- The maximum bandwidth limit that the Server will cap this client or this Read job with. As noted above, the database may intentionally limit the IO resources that can be consumed by demands from external clients.
- Current load of the Server, if known to Coordinator.
Server-side processing behaviour: whether the Server always accepts execution of the corresponding processing plan (as sent to it at step #4), always refuses execution (e.g., if this is a “thin” Server that doesn’t have the resources for any non-trivial processing), or dynamic: the Server accepts or refuses execution based on the current load of the Server at the moment of the request from Client (step #4).
Serving parallelism limit: the maximum number of partitions the Server can serve concurrently to all Clients within this Read job.

Agent’s logic

Agent determines which partitions each Client should request and consume from Servers based on the overall processing job that it is doing. This overall job may be equal to the Read job which is the subject of Table Read protocol interaction, or the Read job could be just a part of a bigger, distributed (cross-database) join, a broadcast join, or something else. Then,

For each Client:
3. Agent → Client: [ (slice processing plan, [partition: (filter, locations, consumption token),]), ]

The lists of partitions for the specific slice (identified by slice processing plan here) that Agent sends to different Clients shouldn’t necessarily include all partitions for the given slice as Coordinator sent them to Agent at step #2. Indeed, in many Read jobs, these lists of partitions in multi-partition slices will be reduced to just one specific partition sent to each Client. Consider the example that was used above, in which Coordinator receives the table processing plan with a grouping on a column that is not a part of table’s partitioning key on the database side. Thus, Coordinator pushes back this plan and returns a single slice with many partitions. Upon receiving such a slice and partition breakdown and faced with the need to perform grouping and aggregation on the client side, Agent sends the partitions to different Clients that will just shuffle the rows by the grouping column. Agent also arranges that the shuffled rows are consumed by another stage in the processing pipeline that does the final aggregation.7

Agent may throttle sending these requests to Clients rather than doing them in one burst if doing the latter would predictably saturate the serving parallelism limits of some of the Server locations.

If the client and server sides share the resource manager, and Agent spawns or selects Clients specifically for this Read job, Agent should use the information about Server locations for network-aware scheduling.

When sizing memory and CPU resources for Clients, Agent should also take into account the server-side processing behaviour of the Servers each Client is going to consume data from: that is, if Servers always accept the slice processing plan, Clients are guaranteed to use fewer resources than if Servers always refuse processing or have dynamic behaviour.

Agent could also modify Server location’s serving hints when Agent sends it to different Clients. Imagine 100 Clients do broadcast join and each Client needs to consume the same partition. Coordinator indicated that two Server locations can serve this partition. Agent can then explicitly modify priority hints of the two locations sent to Clients such that exactly 50 Clients think that one Server is preferable, and 50 Clients think that another Server is preferable. This is needed because Clients don’t communicate and coordinate with each other in Table Read protocol. Coordinator may have indicated that these two servers have equal priority. Absent of Agent’s tie-breaking, Clients may only select Server locations with equal priority at random, but this may skew the load of Servers unnecessarily.

Another reason to modify Server’s priorities (as sent to different Clients) is to ensure that no Server will hit its serving parallelism limit, at least in the “no failure case”, that is, when all Servers respond to Clients’ requests and the load is not redistributed to the remaining Servers.

For the description of the consumption token, see the section “Preemption of partition consumption” below.

Client’s logic

Steps 4-9 determine the exact Server each Client is going to consume each partition from. In these steps, Clients and Servers also determine whether the slice processing plan will be executed on the server or the client side.

Steps 4-9 are embedded in somewhat complicated asynchronous logic, executed on each Client. This complexity addresses the need to handle dynamic server-side processing behaviour, rejections, Server unavailability, and backoff due to exceeded serving parallelism limits.

location_queue = new concurrent queue<(location, partition)>

Upon receiving the step #3 or step #9 request from Agent:
    for every received (slice_processing_plan, partitions):
        for partition in partitions:
            partition.processing_plan = slice_processing_plan
            partition.try_server_side_processing = true
            partition_queue.add(partition)
    
consumer_queue =
  new concurrent queue<(partition, location, consumer info, plan)>
partition_queue = new concurrent queue

async for every (location, partition) from location_queue:
    if partition.try_server_side_processing and \\
      location.serving_hints.server_side_processing_behaviour != Always Refuse:
        server_side_plan = partition.processing_plan
        my_plan = no_processing
    else:
        server_side_plan = no_processing
        my_plan = partition.processing_plan
    4. Client → Server (identified by location):
      partition.filter, server_side_plan, partition.consumption_token,
      consumer_hints
    5. Client ← Server:
      status: Processing Accepted | Processing Refused |
        Serving Rejected | On Hold,
      consumer_info
    if Processing Accepted:
        consumer_queue.add(
          (partition, location, consumer_info, my_plan))
    else:
        if Processing Refused:
            partition.locations[location].refused = true
        elif Serving Rejected, or Server didn't respond:
            // Try with another candidate Server.
            partition.locations.remove(location)
        elif On Hold (if Server reached its serving parallelism limit):
            // Try with another candidate Server, or wait before a retry
            //   if there are no other Servers.
            partition.locations[location].on_hold_backoff_period = ...
        partition_queue.add(partition)

async for for every partition from partition_queue:
    if all(map(partition.locations, lambda loc: loc.refused)):
        // All Servers refused to do the processing on their side,
        //   now ask them again without imposing any non-trivial plan.
        partition.try_server_side_processing = false
    location = ... (Select a priority location for this partition
      that hasn't rejected serving yet and hasn't refused the slice
      processing plan (if any), while respecting locations' "On Hold"
      backoffs.)
    if location:
        location_queue.add((location, partition))
    else:
        6. Client → Agent: partition.filter, partition.processing_plan
        // Steps 7-9 repeat steps 1-3 on the scope of this specific
        //   partition's filter and parent slice processing plan:
        7. Agent → Coordinator:
          partition.processing_plan & partition.filter, ...
        8. Agent ← Coordinator: ... // See step #2
        9. Agent → Client(s): ... // See step #3
        // Then Client(s) that received step #9 requests execute the
        //   "Upon receiving the step #3 or step #9 request from Agent:"
        //   block above.
        

async for every (partition, location, consumer_info, plan)
  from consumer_queue:
    (See section "Streaming column data from Server to Client" below.)

In step #4, Client sends to one selected Server the partition filter and the corresponding slice processing plan. Server can respond (step #5) with one of the following statuses:

Processing Accepted: Server is going to execute the requested processing plan on its side. Server may start background disk or object storage I/O and processing plan execution immediately, expecting Client to start requesting the results soon (see steps 10-… below).

Server also pins all data segments underlying the requested partition, so that even if table data rebalancing is initiated in the database cluster concurrently, Server is guaranteed to keep these data segments until after the streaming of this partition’s data from Server to Client is complete. However, this partition on this Server may already become inaccessible for all other requesters, ensuring that the Server is not stuck with keeping this partition indefinitely because some requesters repeatedly request to serve it.

Processing Refused: Server refuses to execute the requested processing plan because Server is CPU- or memory-bound at the moment, but Server can still serve the input columns of the table to Client. Client will try to find another Server location that won’t refuse to execute the partition plan. Client must be able to execute the plan on its side if all Server locations for the partition have refused executing the plan.

Server must not respond with “Processing Refused” if the requested processing plan is trivial (called no_processing in the above pseudocode block), i.e., after Client already got a refusal from all candidate Servers and now tries just to fetch the input table columns. However, if Server still wants to shed this Client’s request due to temporary overload, it can respond with “On Hold” status.

Serving Rejected: Server no longer serves the requested partition. This can routinely happen if table data rebalancing has happened in the database cluster since Coordinator’s response at step #2. If all locations in Client’s list reject serving the partition or go unresponsive, Client initiates another round of fetching the partition’s Server locations from Coordinator: see steps 6-9.

Note that the database cluster (represented by Coordinator during this Read job) is not obliged to halt all cluster rebalancing operations over the entire length of the Read job. Such pinning is only needed for the specific Server—Client interaction since the Server has responded with “Processing Accepted” and until the data consumption is complete, as noted above.

On Hold: Server can serve the partition, but not right now because Server’s serving parallelism limit is reached, or the Server is generally overloaded. Client can then request the partition from other locations, or from the same Server after some backoff period. (Note: the exact back-off logic is not detailed in the above pseudocode.)

Consumption structure: column groups

If Server returns “Processing Accepted” at step #5, it also returns consumer info. Consumer info specifies how result columns can be consumed by Client from Server. Consumer info can have the following data type:
[column group: [column: [section: (transfer encodings, indexes),],],]

First, columns are arranged into one or more column groups. Column groups are the way for Server to indicate that some groups of columns must be consumed by Client in lockstep, that is, with a shared consumption offset.

Server may want to impose this limitation because there is a shared I/O or CPU processing that is needed to serve these columns, and if Client was free to consume these columns independently one after another rather than in lockstep, it would make Server either hold the I/O or CPU processing results for the whole partition in memory, which might not be even possible if the partition’s data is bigger than Server’s memory, or repeat this I/O or CPU processing multiple times.

The opposite of this is shared-nothing column I/O or processing, such as when Server’s column serving amounts to fetching these columns from files on disk in a columnar format like Parquet or Lance, and relaying the column data to Client over the network. In this case, even larger-than-memory column consumption doesn’t impose shared I/O, and columns could be consumed by Client serially, i.e., one after another (almost) as effectively as in lockstep. Then, Server can return each column nested in its own column group. Client could still consume them in lockstep if its own processing logic demands this, but is not obliged to.

Column sections and transfer encodings

Second, each column is divided into one or more sections, and each column section can be transferred in one encoding from the list of available transfer encodings for the section.

The whole purpose of sections is to reflect the differences in the lists of available transfer encodings (and indexes, see below) across sections. In turn, the point of offering Client to choose among transfer encodings is permitting transfer in the highly compressed on-disk column format, in addition to the standard Arrow columnar format.

Thus, as well as slices and partitions, sections are logical entities and their boundaries may not coincide with the boundaries of the data segments underlying the requested partition on this Server.

If Server does non-trivial processing of a column, Server is expected to return just one section for this column, offering either only one transfer encoding (Arrow), or two encodings: Arrow and a re-compression transfer encoding.

If Server does not process column data before serving it to Client, Server can return the list of sections corresponding to the list of data segments underlying the requested partition. Each section offers two transfer encodings: the on-disk format in which the column is stored in the specific data segment and the Arrow encoding. Client can always fall back to the Arrow encoding if it doesn’t know how to deserialise Server’s on-disk encoding for the given column section. In this case, section boundaries for the column do coincide with data segment boundaries.

Columns within a single group must have the same number of sections, and these sections must hold the same numbers of result rows so that streaming the column group from Server to Client can be organised with a single consumption offset, as indicated above. The exact row numbers themselves may not be known in advance, such as if Server performs row-level filtering.

Columns within different groups, however, may have different numbers of sections, or even different counts of result rows altogether. Cf. “Feature 3: Flexibility” in Lance v2 description. This feature could be used to access columns in feature stores like Chronon and other data stores with materialised aggregations, such as Materialize or RisingWave.

If Client prioritises reducing the network data transfer size over everything else, such as if this Read job exports data from the database hosted in a cloud provider that charges for egress traffic, Client may send the desired (highly compressed) column encodings as a consumer hints at step #4, and Server may agree to perform this re-compression on its side by offering these transfer encodings back, alongside the on-disk and Arrow encodings.

Column indexes

Apart from different transfer encodings, Server may also offer Client to transfer some of the available indexes for each column section, such as inverted index, Bloom filter, geo-index, etc. (see examples of Apache Pinot’s indexes), provided that Client knows how to deserialise (access) and use them.

Arrow’s Dictionaries for dictionary-encoded columns can be treated as a kind of index, too, so that Client can skip downloading the dictionaries for some columns if Client doesn’t need them.

The option to transfer indexes alongside column data is also useful for table or database replication systems that can be built on top of Table Read protocol.

Column sections vs. partitions

The notion of column sections may appear to duplicate the notion of partitions themselves: why not mandate that each partition spans at most one data segment (or a fraction of a data segment) that ensures that each partition can have only one list of available transfer encodings and one list of available indexes, obliterating the need for sections?

I kept column sections in the protocol to leverage databases with Oxia-based (or “Oxia-style”) metadata management, that is, databases in which detailed data segment’s metadata is co-located on Servers with the data segments themselves. In such systems, Coordinator doesn’t hold most of the metadata and even the specific data segment boundaries that appear on the Servers, and partitions are not only Table Read protocol-introduced logical unit of table data but also the database’s own internal unit of table data distribution, more coarse than data segments.

I don’t know of any database that uses Oxia in particular, yet. Some databases may already have Oxia-style metadata management, but I’m not sure. In any case, I’d argue that this style of metadata management would make a lot of sense for scalable OLAP data warehouses: storing all metadata in ZooKeeper is a known scalability bottleneck of Apache Druid and Apache Pinot, for example, among other systems. So, I expect more databases to adopt Oxia-style metadata management in the future.

Streaming column data from Server to Client

Executed on each Client:

async for every (partition, location, consumer_info, plan)
  from consumer_queue:
    // Sequentially, async, or in parallel:
    for every column_group in consumer_info:
        num_sections = len(column_group[0])
        // Sequentially only!
        for section_number in range(num_sections): 
            per_column_transfer_encodings =
              map(column_group, lambda col: ... (Select the transfer
                encoding among the options, based on Client's priorities
                (speed vs. transfer size) and what it can decode.))
            indexes_to_transfer =
              ... (Select what Client needs for its plan.)
            10. Client → Server:
              group.id, section_number, row_offset,
              per_column_transfer_encodings, indexes_to_transfer
            11-... Client ↔ Server:
              data plane characteristics negotiation // Multiple steps
            1X-... Client ↔ Server:
              async streaming of Arrays w/ flow control // Many steps
        async apply the plan and the downstream logic imposed by Agent
          (out of scope of Table Read protocol) to the accumulated
          column data; depending on the downstream needs, this may also
          be outside of the (parallel)
          "for every column_group in consumer_info:" block,
          and applied to all column groups in lockstep.

Steps 11-… constitute a generic streaming protocol for columnar data that I designed specifically for this Table Read protocol because Arrow Dissociated IPC appeared too ad hoc to me. Dissociated IPC protocol is also just a couple of months old, so I’m not breaking some established convention here. But the rationale and functionality of this proposed streaming protocol are the same as of Dissociated IPC: enable packing server-side streams of (Arrow) Arrays into contiguous regions of memory on Client, ready for parallel processing, e.g., on GPU.

Another objective of the new streaming protocol for columnar data is to be flexible enough to accommodate non-Arrow-encoded columns, such as those that are permitted by different column transfer encodings (see discussion in the previous section).

Note that each column group’s section is an independent streaming protocol interaction between Client and Server, with independent flow control. This is because different sections’ column transfer encodings may result in different data plane configurations in the streaming protocol.

Client must not request sections within a column group concurrently, only sequentially. Server may decline a streaming protocol interaction if another interaction (with the previous section number) is still in progress. This is the memory control provision for the server side. If the client side needs more transfer parallelism, Agent may request that as a partitioning hint at step #1 above to begin with, and Coordinator may respond with smaller partitions.

Client may consume different column groups concurrently, although as an extra resource-controlling measure, perhaps Server may add to consumer_info at step #5 an indicator that column groups must also be consumed by Client only sequentially. If Server permits concurrent consumption of column groups, Client may consume them in lockstep because the streaming protocol uses application-level flow control embedded in Client’s code, rather than transport-level or otherwise “hidden” flow control. This idea is inherited from RSocket and Reactive Streams.

Transfer semantics of indexes (see the section “Column indexes” above) are equivalent to transfer semantics for DictionaryBatches: see the discussion of DictionaryBatches in the streaming protocol description. In fact, Dictionaries themselves are treated just as a specific kind of index.

The row_offset that Client sends to Server at step #10 is a provision for Table Write protocol, which is based on Table Read protocol. For Read jobs that are not parts of Table Write protocol’s interactions, row_offset always equals 0.

Data segment-wide buffer tagging and transfer deduplication

The transfer of data segment-wide buffers in the style of Lance v2 (called “file-wide buffers” there) may need to be deduplicated across different column groups. It should be relatively straightforward to implement tagging and transfer deduplication for such buffers, for example, by off-loading their transfers a separate “synthetic stream” (see “mode 4” in the streaming protocol for columnar data, and the section on “multiplexing”). Client can then check if it still holds a copy of this buffer, and skip its transfer if yes.

Availability, consistency, and resilience

Table data availability

If some portion of data segments needed for the Read job are temporarily unavailable on the database side, for example, if the only Server hosting them is restarting, Coordinator should return an empty list of locations for the corresponding partitions. Agent can then periodically retry step #1 with the slice processing plan specialised with the partition filter, as at step #7.

Steps 6-9 (see the section “Client’s logic” above) cover the possible partition unavailability if the database side performs table data rebalancing among Servers concurrently with the Read job.

High Availability (HA) of Agent and Coordinator

High availability, as well as progress checkpointing and crash recovery of Agent and Coordinator are out of scope for table transfer protocols. If some client or a database chooses to implement high availability or either Agent or Coordinator, respectively, these node roles might be played by distributed systems in themselves (or, a Kubernetes pod). But as far as the table transfer protocols are concerned, they are single nodes, however.

By default, a crash of either Agent or Coordinator entails a failure of the Read job. Agent and Coordinator should check each other for aliveness, and terminate all protocol interaction’s activities and resources from their side (client or server side, respectively) if they detect that their counterpart is down.

Read consistency

Support for read consistency in the face of failing partition requests (due to table data rebalancing on the database side) and failing Clients can be layered on top of the above description of Table Read protocol. At step #2, Coordinator may augment all slice processing plans with a special predicate that ensures that a certain snapshot of table data is read. This special predicate may be implemented as a Substrait extension if the server side’s table consistency model lays outside of table semantics, or simply as an extra filter a-la row._last_updated < $TIME if this is the database’s approach to consistent reads.

More sophisticated distributed databases may use a vector clock with MVCC sequence versions rather than a simple scalar time. If the vector clock is too huge to pass around Clients and Servers, Coordinator may generate a shorthand “id” for this vector clock specifically for this Read job. However, with such a design, Servers should dereference this “id” upon receiving requests at step #4 from Clients, which also takes extra time.

Regardless of the server side’s consistency model, all the steps after step #2 in this Read job just propagate the slice processing plans downward without touching the parts they don’t understand, including these predicates for read consistency.

Upon receiving the step #7 request from Agent, Coordinator recognises that the requested table processing plan already includes the read consistency predicate, thus realising that this is a retry request, and maintains the predicate in the processing plan(s) returned at step #8.

For read consistency to be maintained at step #8, Coordinator must prepare its response wrt. a serlializable metadata view, that is, the mapping between Servers and data segments that they host. This requirement is trivially satisfied if Coordinator is a single node, but is more demanding if Coordinator is a distributed system itself (see the previous section). Most open-source databases use ZooKeeper to ensure this. Oxia also offers similar guarantees.

Preemption of partition consumption

If Agent loses the connection to a Client that is in the process of consuming a lot of data from the server side, assumes the Client dead, and respawns a new Client to do the same consumption, while the original Client is actually still alive and keeps consuming data from Servers, this “accidental overprovisioning” of Clients endangers Servers’ serving parallelism limits and may make the “secondary” Client(s) predictably fail due to timeouts, and either fail the entire Read job along with it or at least make the Read job several times longer.

To prevent this failure scenario, Agent should generate a “consumption token” for each Client—partition pair and send them to Clients at step #3. When Agent assumes some Client is dead and respawns a new Client, it creates new tokens that refer to the dead Client’s tokens as “parents”. Then, if Server receives requests (step #4) with new tokens, it immediately aborts the partition data processing and streaming associated with parent tokens, thus freeing the necessary resources.

Consumption tokens could also be used to implement workload re-distribution (”stealing”) from slow Clients or Servers. This problem may be a result of data skew.

Table Read protocol: implementation notes

Table Read protocol’s implementation complexity could be partially mitigated by creating libraries for certain operations that different node roles should perform, particularly surrounding Substrait plan wrangling. It should be possible to implement these functions only once by using Rust to implement the logic and FlatBuffers to pass Substrait plans across the language boundary in Java, C++, or Go runtimes from and to Rust.

Default implementations of the streaming protocol for columnar data could be derived from RSocket implementations.

Table Read protocol overview and comparison with Arrow Flight

In this section, I summarise the differences between Table Read protocol and Arrow Flight. This section doesn’t add new information about the protocol over the “Table Read protocol walkthrough” above. For the differences in scope between Table Read (and Table Write) protocol and Arrow Flight, see the section “Cross-cutting and out-of-scope concerns” above.

Table Read protocol makes load distribution (both on the client and server sides), reliability, awareness of/readiness to table data redistribution in the database, network data transfer optimisation, and database-side resource (CPU, memory, disk I/O, network I/O, etc.) management explicit, first-class concerns.

This enables flexible layering of other protocols and systems on top of Table Read protocol. Table Write protocol is the foremost example: it leverages the properties of Table Read protocol (wrt. load distribution, reliability, and other concerns) to provide efficiency and exactly once writing guarantees. I will describe Table Write protocol in the next article (spoiler: client and database sides are swapped, and the database nodes consume “partitions to be written”).

Table Read protocol’s amenability to composition is due to these concerns (load distribution, reliability, resource management, etc.) shared between the client and the server sides in the Read job.

On the surface level, Arrow Flight is simpler than Table Read protocol, but for achieving the same properties as Table Read protocol wrt. load distribution, reliability, etc., a lot of complexity should be encapsulated on the database side. In fact, cooperative system design allows achieving these properties using simpler techniques. Therefore, the database-side complexity for implementing “resilient Arrow Flight” should be even greater than the complexity of Table Read protocol, described above. So, this (relative) simplicity of Arrow Flight is deceptive.

Moreover, achieving some of these properties is impossible without a layer of database-specific reverse proxy nodes between “core” database nodes and Clients. Usually, only “very serious” cloud-native data warehouses such as BigQuery, Redshift, Azure Synapse, or Snowflake have such proxies in their design, and most open-source OLAP databases don’t have them. Also, these proxies prevent data transfer off-loading to NVLink- or InfiniBand-based RDMA protocols8.

I defer a comprehensive comparison of table transfer protocols with open table format-based catalogs (Iceberg, Delta Lake, and Hudi) to the following article.

Thanks to Rill Data for sponsoring this work. Rill Data is interested in developing the open data ecosystem rather than promoting any specific database solution. Rill’s technology is compatible with various databases mentioned in this article, including ClickHouse, Apache Druid, and DuckDB.

See “BigLake: BigQuery’s Evolution toward a Multi-Cloud Lakehouse” (Levandoski et al., 2024).

At the moment, I know only about Apache Doris and InfluxDB to implement Arrow Flight.

Individual Servers could still push back processing to Clients on the partition level: see the discussion of server-side processing behaviour hint below.

This could be additionally controlled by a boolean hint that Agent sends to Coordinator about whether the client side prefers Servers to take up this distributed processing stage, or prefers Servers to push it back to Clients.

With a possible exception for certain window aggregation processing semantics: it may demand partition filters to “overhang” their parent slice’s filter.

Either as primary durable storage or as disk cache for data segments durably stored in the object storage: I mentioned these different designs in the previous article.

The interface between shuffling Clients and the second stage of such a processing job is outside of the scope of Table Read protocol.

Excepting, perhaps, the cloud provider’s own accelerator platforms that are privy to the corresponding data warehouse’s internal APIs, such as Google’s Vertex AI (integrated with BigQuery), Amazon’s SageMaker (integrated with Redshift), etc.

The future of OLAP table storage is not Iceberg

Roman Leventov — Tue, 30 Jul 2024 18:37:49 GMT

Introduction

In the last few years, table catalogs based on Apache Iceberg, Apache Hudi, or Delta Lake table formats (a.k.a. table catalog specifications) for data warehousing, embodying the lakehouse architecture, have been widely adopted by data teams and celebrated as a success of the open data management ecosystem.

All these table formats entirely rely on object storage services (like Amazon S3) or distributed file systems (like HDFS) for storing both the data partition files (usually in Parquet format) and the metadata about the tables and the data partitions. This alleviates table catalogs from worrying about data and metadata durability, read availability, and write availability.

However, as I detail below in this post, this way to modularise table catalogs is fundamentally limited in performance and cost efficiency relative to the architecture in which table storage is integrated with query (pre-)processing and data ingestion functionalities, such as in BigQuery, ClickHouse, Apache Doris, Apache Druid, Firebolt, Apache Pinot, Redshift, StarRocks, and other data warehouses. These limitations are especially pronounced in high query volume or low query latency (“online”) use cases (OLAP).

These inefficiencies increase the costs (and/or query and data ingestion latency) for all data teams that use table catalogs based on Iceberg, Hudi, or Delta Lake as “single gateways” to their data, either as higher service provider bills or increased provision requirements in self-managed setups. There are also additional drawbacks:

In SaaS setups, the data function is exposed to service failure and service quality risks from more distinct service providers.
In self-managed setups, data teams have to manage three separate systems: the object storage (or distributed FS), the table catalog, and the data processing engine, instead of two: the data warehouse that implements all the table catalog, data ingestion, and query processing functions and, optionally, the object storage (or distributed FS) for cheaper storage of cold data.

I suggest an open table transfer protocols functionally equivalent to BigQuery Storage APIs. The protocols should be implemented by open-source OLAP data warehouses. It would make the data just as accessible from different processing engines as object storage-based table formats do. But unlike adopting these table formats, this would come at no additional cost, system risk, and operational complexity for data teams who already use some OLAP data warehouse for its performance benefits and don’t plan to migrate off of it.

This post is the first part of the four-article series. In this article, I zoom into the architectural limitations of object storage-based table formats relevant to OLAP use cases.

In the second article, I introduce table transfer protocols and describe the design of Table Read protocol.

In the third article, I will describe Table Write protocol and table replication method.

In the fourth article, I summarise the preceding articles and discuss table transfer protocols’ trade-offs relative to object storage-based table formats from the data team’s perspective. Disclaimer: I think there are few, and if the table transfer protocols was supported by data warehouses and processing engines as widely as Iceberg is supported today, most teams would choose to “unlock” the data in their OLAP data warehouse via the table transfer protocols rather than Iceberg for purely technical reasons.

Data warehouses designed for OLAP manage their table storage themselves

All OLAP data warehouses with transparent data tiering between different types of disks and/or nodes in the cluster ClickHouse, Apache Doris, StarRocks, Apache Druid and Apache Pinot1 have to manage their table storage2 on top of the ordinary file system, memory management, and networking APIs in Linux. This includes taking care of data and metadata durability, read availability, and write availability at the level of the distributed system.

ClickHouse, Apache Doris, and StarRocks can also evict cold data from the primary (node-attached) disk storage to object storage or a distributed FS for storage efficiency3, but it doesn’t change the fact that they need to manage their tables on their own, coherently across both disk and object storage tiers.

Most other data warehouses that have a disaggregated architecture with object storage in the bottom, namely Databend, Firebolt, Amazon Redshift (with Redshift Managed Storage), and Snowflake implement read- and write-through SSD caching (e.g., distributed ephemeral storage in Snowflake) and de facto also have to maintain a table storage view on top of these SSD disks, even if they asynchronously store table metadata in Iceberg. The SSD cache management is pretty much equivalent to “full” table catalog/storage management, with only minor differences in replication and eviction strategies.

The performance and efficiency trade-offs between these architectural approaches in OLAP-ready data warehouses are very nuanced and application-specific:

Transparent storage tiering between node types, disk types, and object storage (or distributed FS), as in ClickHouse, Apache Doris, and StarRocks.
Lending all data and metadata in the object storage and doing SSD caching on top, as in Databend, Firebolt, Amazon Redshift (with RMS), and Snowflake (native storage).
A custom, node-disaggregated table storage, as in BigQuery (native storage).
Synchronisation of metadata between the data warehouse’s custom metadata management system and Iceberg, as in Snowflake (with Iceberg tables managed in Snowflake catalog) and BigQuery (with BigLake-managed Iceberg tables).
Other hybrid approaches that fit neither of the above descriptions, such as in Apache Druid and Apache Pinot.

Discussing these trade-offs is beyond the scope of this article. The point that I want to make here is that in all these approaches, data warehouses manage table storage themselves in one way or another rather than rely on austere object storage-based table formats and their implementations as separate services (such as Apache Hive’s Metastore, Tabular, Databricks Unity Catalog, AWS Glue Data Catalog, Dremio, etc.)4 Otherwise, their OLAP performance and efficiency would suffer. I explain why in the next section.

Why object storage-based table formats are inefficient for OLAP use cases

Network IO amplification: inability to filter or pre-aggregate rows on the storage side

When a query includes a simple filter on a table column that is not a sort column in the Parquet file nor the table partition column, the object storage has to send the entire column of the Parquet file over the network to be filtered on the side of the query processing node. This is wasteful if the filter is highly selective and is simple (e.g., an equality or a numeric comparison) so it doesn’t require much of the storage node’s CPU time to apply this filter.

Read amplification also occurs when the query computes total (i.e., non-grouped) aggregates (or aggregates grouped by a low-cardinality column) such as sum, min, max, sum + count (to compute the average in the end), or data sketches for approximate quantiles, distinct count, and more. The object storage nodes cannot compute the pre-aggregation results per data partition file and then send these results to query processing nodes for the final aggregation across partitions. The data sketch structure, let alone a single numerical aggregation result is order(s) of magnitude smaller than the original column to send over the network.

Request amplification when accessing many data partition files on the same storage nodes

Often, there are many more data partition files to be processed in a query than there are query processing nodes (and storage nodes, with self-managed object storage or distributed FS). To read each partition file, a separate series of network requests should be made to the object storage (one request to read the Parquet file footer, then a separate request to read every needed column within each row group) because the query processing nodes don’t know how file’s blocks are physically placed on the object storage nodes.

This request amplification costs money on the most popular S3, GCS, and Azure Blob object storage services: they all charge for every request made to their services.

However, even if the object storage service doesn’t charge extra for each request (e.g., Wasabi), query processing nodes can still endure the overhead of opening and managing 1-2 orders of magnitude more network connections, data buffers, and query execution sub-streams that would otherwise be needed in a setup with a data warehouse-managed table storage on disks or self-managed object storage if the data was transmitted using just one network connection per each pair of storage nodes and query processing nodes that had to hand over the data for a particular query and execution topology.

The tradeoff between ingestion data latency and “small file problem” with storage amplification

The copy-on-write approach in the object storage-based table formats to updating metadata and copy-on-write or merge-on-read approaches to updating data partitions during ingestion lead to a high write and storage amplification due to the object storage overhead of managing a lot of small files (the so-called “small file problem”) if the data engineer wants to commit data to the object storage with high frequency, e.g., once every few minutes, to make the ingested data available for querying with that latency.

Table catalogs mitigate this storage amplification with background compaction and “vacuum” procedures. These procedures themselves are relatively CPU-intensive and lead to even more write amplification. The latter is not a concern for almost all object storage services that don’t charge anything extra for written bytes and file deletes, but this could be somewhat of a concern in setups with self-managed object storage or distributed FS because the burden of the heightened disk write volume and file churn will fall on their shoulders.

However, even if the data team is ready to pay this write amplification price for the data ingestion latency of a few (tens of) minutes, instant insertion latency is completely unachievable unless stateful ingestion nodes with disks are considered by the table catalog. This cannot fit into object storage-only table format designs of Iceberg, Delta Lake, and Hudi.

This is why ClickHouse, Druid, Pinot, Doris, StarRocks, Firebolt, and other data warehouses ingest data on nodes with SSD disks (making these inserts queryable instantly) and manage batching, indexing, and compaction of the data partitions into columnar format (stored also on disks, or in object storage) asynchronously. This is also why Snowflake has ultimately added hybrid tables. Object storage-based table catalogs are blind to this freshly ingested data by design.

Object storage-based table formats miss the innovations in file formats for data partitions

There is a lot of innovation in the space of file formats for columnar data partitions: see Nimble, Lance, BtrBlocks, Vortex, not to mention internal file formats of ClickHouse, Doris, Druid, Pinot, StarRocks, and other data warehouses that also steadily improve.

Apart from “mundane” improvements in the file layout for columnar data and skip indexing that permits reading as few file blocks as possible, the new file formats innovate on new column data types and encodings, such as for vector embeddings. These file formats also have first-class support for secondary indexes, such as inverted/bitmap, full-text, vector, geo, or star-tree indexes. These indexes deliver the most benefit in use cases with a high volume of (OLAP) queries of a particular kind, such as those generated by interactive analytics, text search, geo search, and AI apps.

Secondary indexes in data partition files also synergise with storage-side filtering and pre-aggregation (as discussed in the section “Network IO amplification” above) because secondary indexes reduce the CPU and memory requirements of these pre-computations and thus permit storage nodes (which can be relatively CPU- and memory-poor) to do more queries with such pre-computations in parallel.

Note that the argument in this section is not purely technical, and is not about using object storage per se. Data warehouses can benefit from innovative data partition file formats even for the cold data that is stored in the object storage rather than on the disks attached to the data warehouse’s nodes.

Rather, this argument is about the fact that object storage-based table formats are forced to make the data partition file formats the parts of their specifications.

Although the “big three” table formats allow some flexibility in data partition file formats, that is, choosing between Apache Avro, ORC, and Parquet, they couldn’t easily add new formats to this list because the ability to read these new formats should be supported by all processing engines separately rather than just the table catalog implementations. Many processing engines would lag in adding support for these new file formats. This would undermine the original selling point of the table formats, namely, unlocking the table data for querying from arbitrary processing engines in parallel to the primary data warehouse.5

In practice, even if Iceberg, Hudi, or Delta Lake eventually add new data partition file formats (such as Nimble), it will take a lot of time in discussions among the numerous stakeholders of these table formats.

I don’t imply that new things should be added hastily to open-source table formats on which a lot of stakeholders depend. Unfortunately, on-disk format specifications necessarily move very slowly.

However, a table transfer network protocol would be a more compact interface that can nevertheless benefit from the innovation in the storage layer, while also enabling this innovation by abstracting from storage layout concerns, as I describe in the second part of this three-article series.

Apache Druid and Apache Pinot delegate to the object storage for the durability of “historical” data partitions but ensure read and write availability, as well as the durability of freshly ingested data on their own.

I use the term “table storage” here rather than “table catalog” because few of these data warehouses explicitly encapsulate the functions of table catalogues (such as Iceberg) in an internal component or abstraction. Rather, the realisation of these functions is speared across the data warehouse internals and is intertwined with the implementation of query execution, resource management, and other things that data warehouses do. This lack of encapsulation and, therefore, the lack of (process) isolation of the table storage abstraction is not an issue in itself, except in very exotic situations such as the data warehouse failing cluster-wide due to a “poison pill” query. If the data warehouse was to implement the open table storage protocol that I propose in the second part of this two-article series, it would naturally introduce this table storage abstraction at least on the level of their source code organisation.

Object storage and distributed file systems apply erasure coding to achieve <1.5x storage overhead for durability. In all open-source data warehouses with disk-based storage that I know of, erasure coding is not implemented (and is probably impractical because it also involves query performance trade-offs) and a simple replication scheme is used. This entails 2x or 3x storage overhead.

An example of the latter, “table catalog-first” architecture is Redshift Spectrum (or Redshift Serverless) + AWS Glue Data Catalog for managing Iceberg tables + AWS S3 for storing Iceberg data and metadata. If this approach was performance-optimal and delivered acceptable latency in OLAP use cases, there would be no need for the Redshift Manage Storage offering.

This concern would be somewhat alleviated if a lot of these query engines converged on using Apache DataFusion as an embedded library for storage access. Then, only this library would need to promptly add support for reading the new data partition file formats.

simplicity/acc: Why We Must End Human Programming Jobs

Roman Leventov — Mon, 15 Jul 2024 17:33:09 GMT

Software engineers often gravitate towards projects with minimal non-software components and minimal direct interaction with the real world. This tendency, akin to searching for lost keys under a streetlight, stems from their desire to avoid real-world bottlenecks and business risks that can leave them feeling idle or powerless to influence the fate of the product they are working on.

It's well-documented that the financial industry has grown disproportionally and its present GDP share is much bigger than its actual contribution to the economy. However, this over-bloated financial industry still fails to protect the economy from major crises, such as the Great Recession of 2007-2008.

I'd conjecture that the same happens with software.

Nathan Marz of Red Planet Labs suggests radically simplifying the development of scalable web apps (by 100x in terms of software engineering effort, while also generally making systems more reliable and efficient) by cutting through bloated and over-complicated database programming paradigm. Software engineers don't seem to be overly excited about this. These developments can reduce a lot of well-paid software engineer jobs at large companies. Software engineers act as bureaucrats who hold to their power by retaining the headcount.

This “bureaucratisation” of the software industry may seem like a (relatively) benign way to distribute resources in the economy. This is how banks have established and grown their compliance departments in part as a “social responsibility” response, to employ a lot of accountants and other back-office workers whose jobs became unnecessary with the computer automation of finance and accounting operations in the 1990s.

Alas, bloated, inefficient, unreliable, wicked software will be written by human software engineers to increase their job security. The side effect of this is that the software brings less value to its users and other stakeholders.

Open-source business models also incentivise companies to create software that is difficult to operate because then these companies can sell their support services. Open-source software companies are also incentivised to create too complicated and/or huge service APIs (or just new interfaces, standards, and protocols, leading to the fragmentation of standards) to lock in their customers by making it harder for competitor companies or open-source enthusiasts to re-implement these APIs.

To be clear, I don’t claim that professional programmers always write software as complicated and bloated as possible and are not motivated by the real-world value of their products. This would be absurd. Of course, many programmers care about the elegance and simplicity of the software they are creating. I also don’t think that most programmers overcomplicate software consciously and deliberately.

Creating ideally simple software for the task, i.e., with no accidental complexity at all, requires a lot of cognitive effort in itself which may be impractical to expend, given the expected lifetime of the software. So, practically created software is expected to have a little accidental complexity.

However, at the end of the day, it appears that the vast majority of software systems (created by human programmers today) are significantly overcomplicated beyond that optimum. And many important systems are over-complicated astronomically.

So, regardless of whether this should be explained by bad incentives or simply cognitive limitations of human programmers (creating simple software requires thinking a little harder), I argue that we, human software engineers should replace ourselves with AI software engineers as soon as possible and as fully as possible.

AI programming is coming. AI will be able even to make sense of large swaths of spaghetti code, scattered across dozens of files in a million-LOC codebases. AI could soon write 99% of new production code, but will not do so in certain industries, as described below.

Software-regulated industries

In areas and industries where it will be mandated that all code is still reviewed and vetted by people, such as aerospace, the hardware designs and the safety standards will likely grow only more complicated, in part because systems and software engineers (humans) assisted by AIs could “hold together” larger and more complicated designs. Safety specifications and protocols also tend to grow ever more complicated.

Apart from aerospace engineering, this may happen in nuclear, power systems, and medical device engineering. This will unlikely make this software noticeably safer and more reliable (sans significant breakthroughs in AI-assisted automatic systems verification), but will surely increase its development and maintenance costs.

Fortunately, this seems unlikely to happen in car autopilots, where Tesla already replaced complicated autopilot software with end-to-end NNs (i.e., software 2.0), and regulators didn’t object.

Other industries

In the industries without software certification for safety or strong “human programmer lobby” (but there are few such industries: perhaps, OS, compiler, and database engineering), when it will become evident for the business that AI can program better than humans and debug software better than human and explain the behaviour of the software to stakeholders (including by writing pseudocode and drawing architecture diagrams if needed) better than humans, the business should eventually dismiss almost all human programmers from their jobs entirely, so that humans don’t even review the code written by AI.

I’m unsure about the strength of the “human programmer lobby” in the financial and banking industries (and not everything depends on this lobby). But it would be very unfortunate if finance and banking succumbed to the paradigm where people are required to review and vet all code, as described in the previous section.

Presently, in most software development teams, only human programmers themselves (rather than product designers, product managers, or business analysts) ultimately hold the most complete view of the functional behaviour of the software, let alone its operational characteristics. To dismiss human programmers, the model of software’s behaviour should be externalisable (e.g., to documentation), recoverable from source rather than programmers’ heads, and explainable to people at any requested level of detail, all by AI. So, I think it’s a good idea to develop projects in this area, such as AI tools for generating, improving, and finding inconsistencies in software documentation, software architecture diagramming, etc.

Governing software complexity with metrics

After the business gives up the idea that humans should understand all the code that operates the products, the complexity of their stacks should be governed by a suit of software complexity metrics. I think it’s probably a good idea to develop new such metrics today.

Governing software complexity is important not just internally for the business, e.g. because hard-to-predict emergent failure modes may stem from complicated interaction of too many components, and so that AI doesn’t accidentally write software it won’t know how to fix later (cf. Kernighan’s Law).

The users and other stakeholders of software products should also govern its complexity because, as I noted above, over-complicated software tends to be less efficient and harder to maintain and operate, i.e., have higher a total cost of ownership. Without external oversight, it will be too easy for software vendors to argue that their software has minimal accidental complexity and externalise the overhead to the users.

Open-source software

I expect rapid commodification in the open-source ecosystem. Open-source software vendors will lose their clout if “AI SREs/DevOps/DBOps” emerge that can operate their software as well as humans.

The software shouldn’t necessarily be no-configuration, low-configuration, and “self-running” by design: it’s fine and relatively cheap for an LLM-based AI agent to operate the software. But it’s important for the software to have a good observability harness, or perhaps even first-class thinking about the accompanying SRE agent. I think it’s a good idea to develop frameworks for creating such SRE agents for various bespoke software from source, docs, tests, and telemetry.

When choosing open-source software these days, I’d recommend paying attention to the quality of its observability harness, as well as the helpfulness, understandability, and configurability of its logs, or its instrumentability more than before.

One of the important effects of open source software is the reduced cost of software creation thanks to the re-use of components, as well as sharing the competencies of operating certain software (such as databases) between companies, considering that growing such competencies in human operators and SREs is expensive and takes a long time.

With frameworks for building AI software operators, this should get much cheaper and faster, hence it will make more sense to more bespoke, integrated software for the needs of a particular product. At the same time, software integration permits reducing the integral complexity of the system, as Nathan Marz observes here.

Thus, it’s also plausible that open-source software, unlike API standards and specifications, will deteriorate. Instead, AI programmers will tend to create “collapsed” software bundles (“monoliths” if you wish) tailored for the specific task. Generally, this should be a good thing, albeit retaining some affinity to a limited number of software architecture patterns will still be important for explaining software to humans.

The reaction of human programmers

As long as human software engineers are employed by companies, they will have an incentive to grow the complexity of their software stacks for job security. Thus, for the benefit of the stakeholders of these companies, they should shred human software engineers as quickly as possible. This is what Musk did when he bought Twitter, but less extreme. Or perhaps, this fast reaction was the only way to perform the “surgery” and not trigger an “immune reaction” by the organisation that may have killed it.

Unfortunately, this is not for the benefit of these software engineers and their families. Furthermore, software engineers often don’t have a lot of transferrable skills.

The reaction of many software engineers will be to start their businesses to develop software products.

As per the observation at the beginning of this post, these products will be biased towards being software-centric, i.e., software for software that manages other software, thus still promoting the software bloat and excessive (suboptimal) level of competition in these software-centric areas industries such as database engineering, observability SaaS engineering, DevOps and CloudOps engineering, etc.

I don’t see a policy measure that could effectively counteract this dynamic. But everything that makes real-world-facing product engineering broadly smoother, easier, and less frustratingly slow for programmers should help, such as:

Better and cheaper sensors and other robotic and IoT components
Open hardware standards and interfaces
Better customer feedback systems, perhaps borrowing or developing some ideas from Pol.is
Temporal pattern recognition, pattern categorisation, anomaly and emergent behaviour detection AI on top of raw multimodal sensor data (video, audio, telemetry). Example: Motif Analytics, but think about the same ideas applied to robotics, hardware products, and swarm systems analytics.
Systems for temporally joining multimodal data and metrics from different systems and sources operating at the same time and place (or used by the same company or human), with causal inference, anomaly detection, and root cause analysis on top of these correlated data streams. This should make it easier to debug unanticipated system interference (or seize the benefit of unanticipated system synergy) in real-world deployments.

And even something relatively extravagant, such as decoding animal languages which may create new markets of product markets for animals (I’m not joking).

Summary: the simplicity/acc manifesto

Apply AI power to create simple software.
Create more a la carte tools (such as debuggers, observability, modellers, simulators, security analysers, verifiers, AI-first DevOps, CI/CD tools) to empower AI to create, maintain, and explain simple software more reliably and effectively.
Create more real-world-facing software than software-facing software.
Make it easier for system designers and developers to receive and account for the diverse feedback from the real world and the stakeholders.
Spend the software complexity “budget” on the essential complexity of accommodating diverse, interacting users’ and stakeholders’ needs rather than on the accidental complexity of “self-consumed” software.

P.S. Thanks to David Heinemeier Hansson, Rich Hickey, and Nathan Marz (in alphabetical order) whose writing and presentations inspired a good part of my thinking above.

The two-tiered society

Roman Leventov — Mon, 13 May 2024 07:58:55 GMT

On AI and Jobs: How to Make AI Work With Us, Not Against Us With Daron Acemoglu

Here is Claude.ai's summary of Daron Acemoglu's main ideas from the podcast:

Historically, major productivity improvements from new technologies haven't always translated into benefits for workers. It depends on how the technologies are used and who controls them.
There are concerns that AI could further exacerbate inequality and create a "two-tiered society" if the benefits accrue mainly to a small group of capital owners and highly skilled workers. Widespread prosperity is not automatic.
We should aim for "machine usefulness" - AI that augments and complements human capabilities - rather than just "machine intelligence" focused on automating human tasks. But the latter is easier to monetize.
Achieving an AI future that benefits workers broadly will require changing incentives - through the tax system, giving workers more voice, government funding for human-complementary AI research, reforming business models, and effective regulation.
Some amount of "steering" of AI development through policy is needed to avoid suboptimal social outcomes, but this needs to be balanced against maintaining innovation and progress. Regulation should be a "soft touch."
An "AI disruption reduction act," akin to climate legislation, may be needed to massively shift incentives in a more pro-worker, pro-social direction before AI further entrenches a problematic trajectory. But some temporary slowdown in AI progress as a result may be an acceptable tradeoff.

The prospect of a two-tiered socioeconomic order looks very realistic to me, and it is scary.

On the one hand, this order won't be as static as feudal or caste systems: sure thing, politicians and technologists will create (at least formal) systems for vertical mobility from the lower tier, i.e., people who just live off UBI, and the higher level, politicians, business leaders, chief scientists, capital and land owners.

On the other hand, in feudal and caste systems people in all tiers have their role in the societal division of labour from which they can derive their sense of usefulness, purpose, and self-respect. It will be more challenging for those "have-nots" in the future AI world. Not only their labour will not be valued by the economy, their family roles will also be eroded: teacher for their own kids (why kids would respect them if AI will be vastly more intelligent, empathetic, ethical, etc.?), lover for their spouse (cf. VR sex), bread-winner (everyone is on UBI, including their spouse and kids). And this assumes they will have a family at all, which is increasingly rare, whereas in feudal and caste societies most people were married and had kids.

Vertical mobility institutions will likely grow rather dysfunctional as well, akin to the education systems in East Asia where the youth are totally deprived of childhood and early adulthood in the cutthroat competition for a limited number of cushy positions at corporations, or the academic tenure in the US. If the first 30 years of people's lives is a battle for a spot in the "higher tier" of the society, it will be very challenging for them to switch to a totally different mindset of meditative, non-competitive living like doing arts, crafts, gardening, etc.

Although many people point out the dysfunctionality of positional power institutions like the current academia, governments, or corporations, the alternative "libertarian" spin on social mobility in the age of AI is not obviously better: if AI enables very high leverage in the business, social, or media entrepreneurship, the resulting frenzy may be too intense either for the entrepreneurs, their customers, or both.

Response approaches

I'm not aware of anything that looks to me like a comprehensive and feasible alternative vision to the two-tiered society (if you know such, please let me know).

Daron Acemoglu proposes five economic and political responses that sound at least like they could help to steer the economy and the society in some alternative place, without knowing what place that is (which in itself is not a problem: vice versa, thinking of any alternative vision as a likely target would be a gross mistake and disregard for unknown unknowns):

Tax reforms to favour employment rather than automation
Foster labour voice for better power balance at the companies
A federal agency that provides seed funding and subsidies for human-complementary AI technologies and business models. Subsidies are needed because "machine usefulness" is not as competitive as "machine intelligence/automation", at least within the current financial system and economic fabric.
Reforming business models, e.g., a "digital ad tax" that should change the incentives of media platforms such as Meta or TikTok, and improve the mental health. Cf. Maven social network without followers and likes.

This all sounds good to me, but this is not enough. We also need other political responses (cf. The Collective Intelligence Project), and new design ideas in methodology of human--AI cooperation (such as in the business analytics space), social engineering (cf. Game B), and psychology, as a minimum.

If you know some interesting research in some of these directions or other directions to help reach a non-tiered society that I missed, please comment.

Upd. See comments on EA Forum and Lesswrong.

The vision for AI-assisted decision-making in 3-5 years

Roman Leventov — Thu, 11 Apr 2024 17:04:25 GMT

In the previous post, I suggested a possible AI-first technological stack for (business) decision intelligence, almost not saying anything about the decision-making method itself, apart from the first paragraph where I said that it should be based on state-of-the-art decision theories. But in reality, telling decision makers in business or public service that they should use “quantum-like Bayesian decision theory” is just counterproductive not only because these people obviously won’t treat such advice seriously and won’t have time for that, but also because people already use evolutionarily optimal decision theories intuitively, and it often backfires to try making well-honed intuitive processes more conscious (”rational”).

“Behind the scenes” decision-making management

If we don’t consider direct brain-computer interfacing (on the longer term, we must consider it, but this is out of the scope of this writing), the best way to improve human decision making is for the AI to converse with people naturally, propose and debate different options/solutions, surface various considerations and decision constraints, and discuss the risks. This conversation should not necessarily be exclusively oral: it may include phases where people or AIs prepare, read, and edit decision documents a-la Amazon’s six-pagers.

Below in this writing, I will call this concept of an AI that collaborates with people to make better decisions a co-executive AI (cEAI).

cEAI needs to build “theories of the mind” (ToMs) of the decision makers and experts it is interacting with (while also recognising and accounting for the fundamental limitation of such mental modelling, that it will never be very accurate or complete), identify the weaknesses in the collective understanding or the main disagreements between the people who are involved in making the decision, as well as including oneself (i.e., the cEAI) into this circle, and planning the further preparation, discussion, and making the call.

Of course, this “decision-making process management” done by cEAI should be transparent to people, whether the decision makers themselves or other people who are supervising, investigating, or reflecting on the process, either in real-time or post factum. cEAI might itself proactively reveal some aspects of this management backstage to the decision makers if cEAI deems this will be useful in the given situation for ultimately making a better decision. But in general, during a tough decision making process, human attention should be freed as much as possible from this kind of “meta” process management and instead focused fully on the “object-level” problem at hand. Human attention is a very limited resource!

Improving decision makers’ world models

Among other steps in the decision-making process, cEAI may conclude that some of the people involved are not familiar with some evidence or even don’t have some foundational theories or skills that are important for making reasonably informed and high-quality judgments in the given context.

The “evidence” may be some numerical facts (”What’s our customer churn and ROI in the last quarter?”) or the elements of some quantitative causal models of the problem domain (”What’s the causal relationship between individual vaccination and the risk of getting a disease?”) or anecdotal evidence: such as cEAI may recommend the decision makers to talk with the actual customers to get the feel of their experience. Note that even if cEAI has access to the recordings of the interviews with these customers and can process this anecdotal feedback much better at scale than human decision-makers, considering the ToM limitations that I mentioned above, it will generally be better to try to expose human decision-makers to as much “raw” uncompressed information as possible (which might still not be very much though, given their humanly time and information processing bandwidth constraints).

The “foundational theories of skills” could be anything from basic math (cf. the recurring proposals of making some math proficiency level at the time of the election a formal eligibility criterion for the U.S. presidency) or sciences (can people who are not professional social psychologists, demographers, or anthropologists by trade be in charge of a social policy, or simply consulting these experts or an AI that have read all the textbooks and paper on these subjects is enough?) to the mastery of the professional domain (can someone who hasn’t progressed through the rungs of an industrial company such as an automaker or a fast-food chain from the bottom be an effective leader in such a company, maybe setting aside the trust and authority aspects and focusing only on the quality of the decisions?).

There are some interesting open questions on whether some of these things are actually needed: there is a general Enlightenment-inspired idea that better education is always better for ethics and decision making, but the evidence is not black and white: for example, in The Tyranny of Merit, Michael Sandel points out that many of the best government leaders were not well-educated. In turn, this raises a reasonable counter-argument that perhaps the content of the traditional university education or MBA programs is not what leaders should learn.

Regardless, I think that human decision makers should at least understand the mathematical models that cEAI may suggest to use for the domain, such as regressions or differential equation models. The models that are suitable for most problem domains are not very difficult to grasp (namely, regressions and differential equation models are late middle school or early high school level of math), but in specific domains calling for more complex models, the quality of decisions may be limited by the ability of the people to understand these models. I expand on this point below in this writing.

I mention the scenario of “cEAI recommending the human to upskill or learn some theory before making a decision” to generalise the idea that to make a better judgment, humans exercise not just their business or political intuition or ethics in isolation, but the entirety of models in their heads (on all levels: methodology, science, and fact). These models couldn’t be discerned when the judgement is made. The “decision-making discussion” between humans and the cEAI may and should help to mutually correct the models of the world that they hold, but probably mostly on the fact and domain model levels, rather than on the basic science and methodological levels.

Interlude: why bother?

You may wonder at this point: if cEAI will have so much more knowledge, “reasoning compute”, attention space, and probably soon even the human theory of mind, why should businesses and governments even bother keeping humans in the loop and not delegate decision making completely to the AI?

The temporary answer that should hold for just about the next 3-5 years is that people can glue together the observations from many sources which may span across the organisational boundaries into a unified model, such as when a decision maker visits the customer’s site and talks to their employees offline. The AI alone cannot cross these boundaries very soon: at the very least, it will take at least the next 3-5 years until always-on wearable recording devices such as AI Pin are adopted in business en masse. And this will take even longer than 5 years in the public sector where it will be met with much stronger resistance due to the concerns about privacy/security and the general conservatism. Replacing humans with robots altogether would also, obviously, take longer than just equipping people with small wearables.

There are other temporary answers, such as:

Humans are better than AIs at noticing subtle patterns and clues of something going wrong from a stream of multimodal unstructured data: video/audio observations and chatting with people.
Humans are better than AIs at curiosity-driven exploration and inquiry that may lead to the discovery of these patterns from the previous item.
Humans are better than AIs in weighing multiple aspects of a decision: technical/engineering, economic, legal, PR, social, ethical, etc.
Humans are better than AIs at glueing the pieces of their world models, not only across the organisational boundaries as discussed above, but also across the different system levels: from the interpersonal relations among the team members to macro-economic and political trends.

While probably mostly true as of 2024, I think these advantages of humans will not hold against AI for more than three years. Therefore, after that point, the AI’s decision quality will be limited by the ability to do the “leg work” to gather the multimodal unstructured data that today enables people to make better intuitive inferences and thus decisions. In other words, this is the same limitation that I noted first in this section. The AI will overcome it with the sufficient uptake of wearable recording devices, and, for non-digital businesses, with the installation of more cameras at the production and service sites.

Then it will come the turn of legal, political and economic answers:

The laws will not change anytime soon and will maintain that only humans are liable for all business, public service, medical, financial, and legal decisions, and it’s absurd to hold people accountable for the decisions (made on their behalf by AI copilots) that they cannot understand. In turn, this will steer the market offerings of AI decision-making copilots such that they can in principle teach people the models the AI is using in decision making, save for criminal negligence due to laziness or YOLO attitude.
Even if the employee functions will effectively be reduced to carrying the AI device around and rubber-stumping its decisions, trade unions or other forces in corporate politics will prevent businesses from replacing humans in these roles with robots altogether.
Paying people salaries for doing meaningless jobs (as in the previous item) is a more effective system of economic resource distribution (and, perhaps, of social order) than replacing all people with robots that will force the governments to pay all the same people the same money via universal basic income. People on UBI may overflow all the touristic sites, become alcohol/drug/gaming addicts in unhealthy numbers, and generally be less healthy and live less happily without the constraints of a daily routine. It may take decades to transition society into the new regime in which all today’s decision-making jobs could be handed off to robots and this won’t cause sudden adverse effects for either the economy or the society.

The first two of these three answers might plausibly play out and stall the AI proliferation in decision making for decades, though, I can also imagine that the monetary incentives will cause mass non-compliance or regulatory arbitrage and thus these restrictions will be ineffective.

But it’s still interesting to understand whether these legal and political obstacles are “accidental evil” (or an “accidental good” if we consider that they buy time for the societal transition mentioned in the third item), or they help to protect a good blueprint for human—AI collaboration in high-stakes decision making. After all, the laws and political decisions ultimately ought to promote and protect the collective good to the best of our guess, rather than to ossify historical accidents. I’m not satisfied with politically convenient platitudes from the marketing playbooks of AI startups like “we believe in empowering humans in their work rather than replacing them with AIs”.

And I think I’ve found a deep reason why we should augment human decision makers rather than replace humans with AI executives altogether: this is the only way to maintain the relevance of human ethical judgement and thus to keep alignment a meaningful target even in principle. In the absence of really high-fidelity human brain simulations, it dawned on me that it may be a category error to call various “scalable oversight/alignment” approaches alignment in very complex situations which no human has been able to comprehend: the humane, ethical, or “aligned” judgement may not just be unknown, it may not exist in this case. There is more to discuss here about how economics and society should be structured in order to make this comprehension feasible (barring breakthroughs in brain-computer interfacing), but this is out of the scope of this writing.

It remains to note in this section that we must not squander the next 3-5 years when human—AI executive teams will still have an advantage over independent AI executives (agents): we should capitalise on this temporary edge (literally, through VC funding and technology and product development) to keep the human—AI collaboration competitive for as long as possible (cf. Andrew Critch’s humane/acc) to make it comparatively easier for politicians, legislators, and the public to resist to the temptation to enable an unhinged economy with AI agents.

Google’s medical diagnosis system outperforming human clinicians assisted by the very same system demonstrates that first, we don’t have much time left (although physiology is more “closed world” than business and public service, which makes medical decisions simpler), and second, that AIs designed to perform independently don’t become good human assistants as a byproduct: collaborativity must be an explicit and key design goal of cEAI, not just interpretability (apart from accuracy, robustness, adaptability, and other usual design criteria for decision systems).

Functions of the co-executive AI

The best form factor for cEAI appears to be a natural-language interface accessible across different apps, a la Copilot for Microsoft 365.

Compared to the “chat with your data” wave of AIs such as Julius or Hex, cEAI could be better summarised as “chat with your problem”.

When the human initiates the discussion, cEAI understands the problem: what domain it belongs to and whether the situation requires some decision or not: per Peter Drucker, this should be the first step in every decision-making process! At this stage, the cEAI only uses textbooks about management and optionally description of the structure and the domain breakdown of the business (usually maintained in a corporate wiki).

For the given domain and the problem, cEAI builds a qualitative causal model if it doesn’t exist yet from natural-language conversations with the executive or domain experts, challenging humans’ models if eEAI’s own common sense diverges from the human models. Products like Rainbird.ai already implement this:

If the qualitative model is insufficient for making a decision, cEAI generates one or several quantitative models (that is, parameterisations/augmentations of the qualitative causal model) as alternative explanations for the historical data observed in the domain. This could be done with something like Open Code Interpreter in conjunction with AutoML techniques. To make this possible, the variables from the qualitative causal model should be connected with (integrated into) the semantic layer of the data. The semantic layer should also include the notion of domain boundaries to limit what cEAI needs to cover with the quantitative model.

For human data scientists, analysts, and decision makers, the quantitative model, even if it is getting built, represents just a part of the deeper understanding of the problem domain that people acquire when they eyeball the data points, explore various data visualisations, debug erroneous data, the code for features, or the semantic layer, and observe the performance of other models that are tried before the “final” one is chosen. If cEAI does the data science for people, it still needs to help people to build this deeper picture in their heads by creating custom “curriculums” which may consist of all the same things: data point examples, visualisations, semantic layer code or configs, model comparisons, or even small tasks designed deliberately to calibrate the quantitative model in people’s heads, such as prediction, estimation, and fill-in-the-blank tasks. The main difference from humans doing this on their own is that a guided learning process will take much less time for anyone who is not an extremely skilful data scientist. In terms of the implementation, the teaching part of this function is similar to existing teacher AIs such as Khanmigo, but the curriculum generation part seems to be genuinely novel.

As I started to discuss in previous sections of this post, cEAI should maintain a theory of mind of the human decision maker to know when to defer to human judgement because they have access to some evidence that cEAI doesn’t or because human’s expertise is hard or impractical at the moment to communicate as an explicit causal model, and conversely, when the human judgement is different from cEAI’s likely due to human ignorance rather than their superior expertise. Further, cEAI could learn something about people’s preferences from their decisions and use this e.g. for helping multiple executives to find an agreeable solution for a problem. Although SoTA LLMs already can build surprisingly decent ToMs “intuitively” (Kosinski, 2023), cEAI probably needs to manage these ToMs more explicitly for transparency and debuggability.

Finally, cEAI can do some downstream functions related to causal modelling and decision intelligence leveraging existing algorithms, such as inferring, augmenting, or verifying the causal graph from the data (see causal-learn), detecting and explaining data anomalies (Janzing et al., 2019), and generating solutions for a problem (see evolution.ml).

Simpler models are even more important for collaborativity than interpretability

Making models simpler is good for robustness, computational efficiency, and interpretability, but it becomes even more important for cEAI’s collaborativity. Due to the opacity of the human mind to itself as well as cEAI, assessing how well humans are calibrated wrt. complicated models is impossible, and reliably teaching them to human executives also becomes almost impossible.

The Open-Source Stack for Decision Intelligence

Roman Leventov — Mon, 11 Mar 2024 19:53:37 GMT

Probabilistic models are integral to state-of-the-art decision theories, such as Bayesian a.k.a. evidential, quantum-like Bayesian, causal, and functional/updateless/timeless decision theories. To enable widespread and inclusive adoption of these decision theories by businesses and in public service, there is a pressing need for an open-source decision intelligence stack. This stack should span from sensing (data collection) and actuation to online learning and natural-language decision assistance.

The Decision Intelligence Stack

The proposed decision intelligence (DI) stack can be broken down into the following levels:

Sensors and edge devices: cameras, wearables, etc.
IoT device management, data collection and processing systems, and databases: MQTT brokers/servers, log collectors, change data capture, Airbyte, Kafka, Flink, data lake storage, Spark, Ray, ClickHouse, Rockset, Postgres, etc.
ML computing frameworks: PyTorch, JAX/Flax
Probabilistic ML algorithms for time-series and other forms of data: probabilistic state-space models, (Bayesian) RNNs, regression trees, VAEs, GNNs. Recent libraries like dynamax and PyMC-BART are filling the gap in perpetually developed and supported implementations of these algorithms.
DNN (mechanistic) interpretability libraries and frameworks: needed for communicating inferences to humans (via LLMs). Currently non-existent, to the best of my knowledge.
Frameworks for probabilistic inference over a system of components: Stan, PyMC, NumPyro. They support various approximate inference methods, including sampling-based, variational, particle-based, and GFlowNet-based methods. See Štrumbelj et al. (2024) for an overview.
MLOps frameworks for ML and probabilistic inference: Airflow-like tools integrated with PyMC or NumPyro, automatically scheduling online learning tasks when new observations are available (a-la rebayes), and updating past states via retrospective inference.
Hyperparameter optimization/AutoML frameworks: Optuna
Causal discovery and inference libraries: causal-learn, DoWhy, CausalPy
Libraries of domain-specific models and probabilistic program templates: PyMC-Marketing, BatteryML
Baseline knowledge bases for augmenting proprietary models: Open Research Knowledge Graph, system.com
Data catalog/metadata storage to help LLMs generate probabilistic model code: OpenMetadata
LLM to generate and modify probabilistic model code, using previous model versions, results, domain knowledge, data catalogs, and public knowledge as inputs: PyMC-GPT is a first step in this direction.
UI to visualize and explore data, causal models (applying DNN interpretability if needed), and counterfactual predictions. Current options include Python visualization libraries (matplotlib, seaborn, etc.), Graphviz-style causal graph visualization, LIDA for LLM-generated visualizations, Rill for data exploration, and Dara for causal-graph-aware visual app building.
API frontend for causal inference: frameworks like BentoML
Decision-making LLM that uses the (discovered) causal graph for LLM+KG reasoning and causal inference API as a tool (Toolformer-style): DeLLMa

Challenges and Opportunities

While some key layers in this stack are well-populated with mature tools, many others have few actively developed options. Apart from established data processing and ML components not specific to causal inference, the layers are rarely integrated with each other.

The depth and complexity of the stack can be daunting, and it's likely that no organization in the world has implemented it in an approximately complete form. As argued by causaLens, it's currently impractical for organizations to weave together a stack of open-source tools for end-to-end causal inference instead of opting for a vertically integrated solution. However, this highlights the need for an accessible and well-integrated open-source stack.

Given the recent consolidation trend in the software industry, the focus should be on good integration and user experience with an opinionated, general-purpose combination of components and algorithms, rather than on configurability and flexibility. Both the PyWhy and PyMC ecosystems, the most notable probabilistic and causal inference ecosystems in Python today, may be focusing too much on implementing diverse algorithms instead of polishing a vertically integrated experience.

Note that choosing specific components for the stack doesn’t mean cutting off possible use cases: the stack should remain broad in capability. Rather, choosing components is about removing the choice among functionally equivalent options, such as databases supporting the same query patterns, or probabilistic inference algorithms achieving approximately the same result through different means.

While the DI stack itself should be as general and broad-capability as possible, the go-to-market strategy for propelling it should initially focus on one application domain at a time. Infrastructure and application monitoring (and cost management) is an interesting choice as one of the first such domains, currently dominated by expensive SaaS platforms (Datadog, New Relic, Splunk) with a few dynamic open-source challengers (SigNoz, Highlight, Coroot). Many of the higher levels of the DI stack could be omitted for infrastructure monitoring, and non-parametric models are sufficient for infrastructure performance monitoring. SigNoz and Highlight also notably share the “opinionated integration philosophy” that I recommended for the DI stack above.

AI alignment as a translation problem

Roman Leventov — Mon, 05 Feb 2024 13:23:22 GMT

Yet another way to think about the alignment problem

Consider two learning agents (humans or AIs) that have made different measurements of some system and have different interests (concerns) regarding how the system should be evolved or managed (controlled). Let’s set aside the discussion of bargaining power and the wider game both agents play and focus on how the agents can agree about a specific way of controlling the system, assuming the agents have to respect each other’s interests.

For such an agreement to happen, both agents must see the plan for controlling the system of interest as beneficial from the perspective of their models and decision theories1. This means that they can find a shared model that they both see as a generalisation of their respective models, at least in everything that pertains to describing the system of interest, their respective interests regarding the system, and control of the system.

Gödel’s theorems prevent agents from completely “knowing” their own generalisation method2, so the only way for agents to arrive at such a shared understanding is to present each other some symbols (i.e., classical information) about the system of interest, learn from it and incorporate this knowledge into their model (i.e., generalise from the previous version of their model)3, and check if they can come up with decisions (plans) regarding the system of interest that they both estimate as net positive.

Per Quine’s indeterminacy of translation argument, the agents cannot access the actual meaning (semantics, knowledge) that they attach to the symbols that they exchange with each other, both during the “generalisation” and “decision/plan selection” steps (because decisions and plans are also communicated through symbols!) Therefore, the agents cannot know that they generalise “correctly”, actually possess a shared model, and haven’t been fooled (advertently or inadvertently) by each other.

Note that without the loss of generality, the above process could be interleaved with actual control according to some decisions and plans deemed “good enough” or sufficiently low-risk after some initial alignment and collective deliberation episode. After the agents collect new observations, they could have another alignment episode, then more action, and so on.

Scalable oversight, weak-to-strong generalisation, and interpretability

To me, the above description of the alignment problem demonstrates that “scalable oversight and weak-to-strong generalisation” are largely misconceptions of this problem, except insofar as oversight is the implementation of the power balance between humans and AIs4 or the prevention of deception (I factored out both these aspects from the above picture).

Yes, there will always be something that humans perceive about their systems of interest (including themselves) that AIs won’t perceive, but this looks to be on track to shrink rapidly (consider glasses or even contact lenses that record all visual and audio to assist people). There will probably be much more information about systems of interest that AIs perceive and humans don’t. So, rather than a weak-to-strong generalisation problem, we have a (bidirectional) AI—human translation problem, with more emphasis on human learning from AIs than AIs learning from humans.

This is also why I think Lucius Bushnaq’s “reverse engineering” agenda is stronger than the mechanistic interpretability agenda. Both are a kind of “translation from AIs to humans”, but the first aims to teach humans AI’s “native” concepts, whereas the latter at least to some degree tries to impose the pre-existing human belief structure onto AIs.

More “natural” and brain-like AI

The “translation” problem statement presented above also implies that if we care about the generalisation trajectory of the civilisation, we should better equip AIs with our inductive biases (i.e., generalisation methods) rather than make AIs with less of our inductive biases and then hope to align them with weak-to-strong generalisation.

Most generally, this makes me think that “Natural AI” (a.k.a. brain-like AI) is one of the most important among theoretical, “long shot” alignment agendas.

More incrementally, and “prosaically” if you wish, I think AI companies should implement “the consciousness prior” in the reasoning of their systems. Bengio has touched on this recently in his recent talk:

I won’t even mention other inductive biases that neuroscientists, cognitive scientists, and psychologists argue people have, such as the algorithms that people use in making decisions, the kinds of models that they learn, the role of emotions in learning and decision-making, etc. because I don’t know the full landscape well enough. Bengio’s “consciousness prior” is the one that I’m aware of and that seems to me completely uncontroversial because the structure of human language is strong evidence for it. If you disagree with me on this or know other human inductive biases that are very uncontroversial, please leave a comment.

Since merely “preaching” AI companies to adopt more human-like approaches to AI won’t help, the prosaic call for action on this front is to develop algorithms, ML architectures, and systems that make the adoption of more human-like AI economically attractive. (This is also a prerequisite for making it politically feasible to discuss turning this guideline into a policy if needed.)

One particularly promising lever is showing enterprises that by adopting AI that samples compact causal models that explain the data, they can mine new knowledge and scale the decision intelligence across the organisational levels more cheaply: two AIs trained on different data and for different narrow purposes can come up with causal explanations (i.e., new knowledge) on the intersection of their competence without re-training, but rather trying to combine the causal models that they sample and back-test them for support (which is very fast).

Another interesting approach that might apply to sequential time datasets (with a single context/stream) is to train a foundation state-space model (SSM) for predicting the time-series data, run the model through the timeseries data, and treat the hierarchy of state vectors at the end of this run as a hierarchy of “possible causal circuits in superposition” which could then be recombined with causal graphs/circuits from different AIs or learned by the same AI from different context.

Finally, we could create an external (system-level) incentive for employing such AIs at organisations: enable cross-organisational learning in the space of causal models.

For brevity, I will call “model(s) and decision theories” used by an agent simply a model of an agent.

See “Meaningful things are those the universe possesses a semantics for”.

Let’s also set aside the concern that agents try to manipulate each other’s beliefs, and assume the trust and deception issue is treated separately.

However, it doesn’t seem so to me: if humans can oversee AIs and control the method of this oversight, it seems to me they must already have compete power over AIs.

Gaia Network: An Illustrated Primer

Roman Leventov — Thu, 25 Jan 2024 11:35:02 GMT

This post is primarily written by Rafael Kaufmann, my contributions were minimal. Warning: a long read!

In our first LW post on the Gaia Network, we framed it as a solution to the challenges of building safe, transformative AI. However, the true potential of Gaia as a “world-wide web of causal models” goes far beyond that, and in fact, justifying it in terms of its value to other use cases is key to showing its viability for AI safety. At the same time, the previous post focused more on the “what” and “why”, and didn’t talk much about the “how”. In this piece, we’ll correct both of these flaws: we’ll visually walk through the Gaia Network’s mechanics, with concrete use cases in mind.

The first two parts will cover use cases related to making science more effective and efficient. These would already be sufficient to justify the importance of building the Gaia Network: as “science is the only news”, improving science can have a huge positive multiplier effect on our future survival and prosperity. Yet despite a workforce of 8.8 million researchers and funding that adds up to 1.7% of global GDP, science is rightly criticized for inefficiency and limited accountability. The third part will expand beyond the epistemic (scientific) benefits of the Gaia Network and towards pragmatic impact - ie, making all decision-making more effective and efficient, which impacts the entire world population and GDP. And the last two sections will focus on the applications of the Gaia Network on existential risk - first specifically with regard to AI safety, and finally as a general tool for collective sensemaking and coordination around the Metacrisis.

For brevity’s sake, we will not cover any of the implementation details or mathematical grounding. We’ll focus on the core concepts and capabilities, and try to explain them in plain language. We’ll also skim over much of the “hard parts”: the economics and trust modeling. Finally, we will not cover the arguments for convergence and resilience of the network; these have been already sketched out in our previous paper, and merit a more formal and in-depth analysis than we can incorporate into this primer. If there’s some hand-waving in the below that makes you uncomfortable, please let us know in the comments and we will attempt to assuage you.

The beginning will take a bit long with Bayesian statistics, as these are foundational concepts for the Gaia architecture. Feel free to skip the footnotes if you’re overwhelmed. Also, note that everything below assumes explicit or clear-box models (where model parameters have names that reflect their intended semantics). In a future article, we’ll discuss how to incorporate black-box models like neural networks, where most components (neurons) have opaque semantics (or are mostly polysemantic).

So let’s get started. Fast forward to a few years from now…1

Better science bottom-up

You’re a plant geneticist working on the analysis of some experimental results that you want to publish. You have a model of how your new maize strain improves yields, and you’ve tested it against an experimental data set. (In the example pseudocode below, we use a Python-based syntax for concreteness, but this could be implemented in any statistical analysis software or framework, like R or Julia or even Excel spreadsheets.)

def model(strainplanted, soiltype, rainfall, cropyield):
    ## Set parameter priors
    deltayield ~ Normal(...)
    avgyield_control ~ Normal(...)
    avgyield_experimental ~ Normal(avgield_control + deltayield, ...)
    𝛽_soiltype ~ Normal(...)
    𝛽_rainfall ~ Normal(...)
    ...


    ## Define likelihood of the target variable cropyield
    ## given the covariates and parameters:
    ## p(cropyield | strainplanted, soiltype, rainfall, ...params)
    with field = plate("field"):
        with t = plate("t"):
            if strainplanted == "control":
                baseyield = avgyield_control
            else:
                baseyield = avgyield_experimental
            soiltype_effect = 𝛽_soiltype[soiltype]
            rainfall_effect = 𝛽_rainfall * rainfall
            cropyield ~ Normal(𝛼_yield + 𝛽_soiltype + 𝛽_rainfall)

Like most scientific analyses, this is a hierarchical model, where your local variables represent observations or states of the current context – say, the yield in each given season and field – and are influenced by parameters that represent more generic or abstract variables – average yield for your strain across all fields and seasons, which in turn depends on the expected yield improvement from a given genomics technique. (The latter is generic enough that it’s not really specific to your study, which is why it’s highlighted in orange below.)

Running this model on a data set can be understood as propagating information through the graph. First, the priors for the parameters inform the expected distributions for the local variables. Then as we gather observations for some variables, that information flows back up, giving updated posteriors for the parameters. The amount of information (uncertainty reduction or negentropy) being propagated can be understood as a flow on this graph2 and indeed can be estimated as an output of many kinds of common inference algorithms.3

It’s really useful to think informally of the free energy of the model as the discrepancy between the inferred distribution and the information we have available, between priors and observations.4 Zero free energy is the ideal state in which all information has been fully incorporated into the inference, is completely internally consistent, and explains away all the uncertainty in the system. Typically we can’t achieve zero free energy, as there’s always some uncertainty (whether aleatoric or epistemic), but we want to minimize it so that our model doesn’t have “extra”, unwarranted uncertainty. To get a better understanding of the concept of free energy and its role in Bayesian modeling and active inference, there are many excellent resources available; we particularly recommend this paper by Gottwald and Braun. Going forward, you can just think of Free Energy Reduction (FER) as a standard unit of account used by each model.

Source: Gottwald and Braun (2020)

But here’s a problem: How do you set priors for your parameters in the first place? Sure, you expect your strain to increase yield, but it would be circular reasoning to build that expectation into the priors. The common practice is to use a flat prior (also known as a weakly informative or regularization prior), that incorporates only information that you have an objective or incontrovertible reason to believe in (ex: penalizing unreasonably low or high values). This can be seen as “not sneaking information into the model”, to avoid fooling yourself (and your stakeholders, the people who will use your study results to make decisions) by publishing unjustifiably confident results.5 However, typically, most parameters in your study do not represent hypotheses you’re actively trying to learn about; instead, they represent assumptions that are justified by previous studies or expert opinions. For those, you want the opposite kind of prior, a sharp or strong prior.

In the past, if you were very lucky, there would be a published meta-analysis about the parameters for each of your assumptions, to save you the pain of combing through thousands of PDFs, understanding each, and copy-pasting numbers from the relevant tables into your workspace. Unfortunately, this work was so mind-numbingly boring, expensive, thankless and error-prone, that high-quality meta-analyses were exceedingly rare.6 To make matters worse, unlike the toy example above, real-world scientific models often utilize hundreds to thousands of parameters, and often far more if machine learning is used. Gathering the outputs of every relevant study for every relevant parameter, by hand, was infeasible, so we ended up with constant wheel reinvention and cargo-culted, unjustified assumptions, often used as point estimates with no uncertainty attached.7

No longer: now you can simply connect your local model to the Gaia Network by annotating each parameter (in our example, average yield and drought tolerance, for both the control group of traditional maize and the experimental group of genetically modified maize). Your annotation attaches each parameter to a global namespace called the Gaia Ontology. You can browse the Ontology to see the exact definition of the parameter, with example code, and make sure you’re using the right one. Many other scientists have published their own studies on the Gaia Network; each published study contributes a posterior distribution for its parameters, and these are algorithmically aggregated into a “sort of weighted average” called a pooled distribution.8 9

So at inference time, the Gaia engine just queries the network for the current pooled distributions for each of these parameters – effectively conducting a meta-analysis on the fly – and adopts them as priors.10 11

@gaia_bind(deltayield={
    "v0.Agronomy.YieldImprovementPct: {
        "species": "v0.Agronomy.Species.Maize",
        "intervention": "v0.Agronomy.Genomics.CRISPR"}})

def model(strainplanted, soiltype, rainfall, cropyield):
	## Model code is unchanged

As the illustration above shows, your model is importing information from other studies in the network and using it to increase FER. Gaia keeps track of the “credit assignment”, which will prove valuable starting in the next step, which is to publish your work.

To contribute to the network, all you have to do is commit your study to GitHub. Gaia will save your posterior distributions for all parameters that you’ve annotated, and share them back with every other study in the network. Your study and your peer studies each have an update chain, an append-only sequence of distributions representing the state of posteriors from each study’s perspective. These are effectively independent representations of the state of knowledge of the parameters in question.12

So, immediately you can see that any other agricultural studies about different experimental strains will have their posteriors affected by adding your study to the pool of updates.13 This effect can be quite large if few studies are being pooled, but it converges, so that after some point the updates become minimal.14

But Gaia doesn’t just propagate posterior updates to “sibling” studies: If there are higher-level models for which your parameter is a leaf, it will propagate up to those as well. For instance, a model that forecasts advances in crop technology and their impact on global food security:

Note that, by publishing your model on the Network, you’re not exporting any information other than updates to the values of the specific parameters that you’ve connected – in particular, you’re not sharing any of the underlying data. (This is a “privacy-centric” inference approach, analogous to federated learning. In a follow-up article, we’ll discuss how we can solve the problems of trust that this imposes.)

As mentioned above, the Gaia Protocol assigns credit to every publication (also called attribution). The mechanism for attribution is primarily “subjective”: each node (ie, each study) just measures the net FER impact of each contribution as it’s incorporated into its own update chain.

Above, we mentioned that the pooled distribution is a “sort of weighted” average between each study’s posterior. So where do the weights come from? The Gaia Protocol also answers this question in a bottom-up, “subjective” way. Nodes can independently infer the “right” weights for each parameter and study. To do so, they can use arbitrary “metamodels”, ranging from simple “beauty contest” models that just aggregate the net FER impact that other studies have attributed to a contribution, to “web of trust” models that try to factor out more sophisticated ones that infer the presence of low-quality studies or deliberate fraud via social-type network analysis; to true “metamodels” that infer study quality and parameter relevance, using outside data such as the publisher’s credentials, analyses of the model code and third-party verifications of the data. This means that, at least in the short term, the pooled distribution for a given parameter is actually different depending on which node you ask! Even if all nodes have seen all the updates in the same order, they can give arbitrarily different weights to them. But as the different metamodels themselves accumulate quality signals, nodes eventually converge on a shared inference of the right metamodel to use on which kind of parameter. (As discussed in the introduction, we will not attempt to justify the claim that this protocol converges and is resilient to noise and misinformation/fraud. For now, see the arguments here.)

Funding science – retroactively and prospectively

Now approaching this from the opposite perspective, say you're an analyst at a philanthropic foundation, trying to make recommendations for a prize that will be awarded to the most impactful scientific studies. Rather than solely rely on recommendations from the scientific community, or use “impact factors” that just measure popularity, you can query the Gaia Network to get quantitative, apples-to-apples impact metrics.

First, we should just note that being able to understand the “graph of science” in a live, transparent way – what are the research questions, how well developed, how much intensity in explore vs exploit mode, and how they connect to each other – is a game-changer. In the past, you needed to pay expensive fees for products like Web of Science and Scopus, which were based on manual curation and benefitted from the opaqueness of text-on-PDFs as the primary means of scientific communication. Having all the world’s science directly represented as machine-readable and connected models on Gaia – just like code and its dependencies on a package manager – makes all analytics orders of magnitude easier.

Now, back to the question of impact. Here we should distinguish two kinds of impact: epistemic impact - how much a given study has contributed to reducing uncertainty in the Network; and pragmatic impact - how much it has contributed to improving decisions. We’ll leave the pragmatic impact for later and focus on the epistemic impact for now.

So, for every model on the network, it’s easy to compute how much it contributed to FER flow across the network – what’s called credit assignment in neural networks. We just look at the net flow across the model boundary, which is accounted by the Gaia Protocol:

Some care needs to be taken here. First of all, note that the FER credited to a model due to its contributions is always computed by the model that’s receiving the contributions (it’s a subjective value). Plus, there might be significant differences in modeling practices between different fields, which may distort calculations. Later on, when we talk about economics, we’ll see that the Protocol also needs a way to turn that subjective value accounting into intersubjective, mutually agreed upon “local exchange rates”. For now, let’s just say that you compute a normalization constant for each domain and use it to get a normalized, apples-to-apples net FER flow across domains.

So this covers retroactive funding in the form of prizes. But this isn’t (and shouldn’t be) how most science gets funded. Most researchers cannot internalize the risk and cost of self-funding their own work upfront and hoping for retroactive funding later. Instead, funders – who have access to cheaper capital costs, lower marginal risk sensitivity, and the other advantages that come with a big pile of cash – contract with researchers upfront to trade capital now for a future flow of impact. Before the Gaia Network, establishing effective contracts was very challenging, as it was extremely hard to predict impact, even for the researchers themselves, let alone for the funders. (In economic parlance, it was a classic agent-principal problem created by uncertainty and information asymmetry.) Now, the Gaia Network itself provides the solution: it contains metascience models that model the flow of FER across the network and use it to design interventions – adding more models and more data to specific fields and individual lines of research – that are likely to deliver the highest future flow of FER. Funders and researchers can equally use these models to guide where they should spend the most time and resources.15 Compared with the recent past, where there were no meaningful metrics of scientific productivity or value added, let alone predictive models of how to improve these, the Gaia Network is a game-changer for science funding.

A distributed oracle for decision-making

The above covers the advancement of science as its own objective. However, the same capabilities can aid any decision-making that pertains to the real world16 – what we’ve called pragmatic impact above. Indeed, the Gaia Network has given everyone an actionable, reliable way to “trust the science” – not just on big things like climate change and pandemics, but also on day-to-day things like your diet, exercise, relationships, and so on. And the same applies to business decision-making, which is where we’ll focus next.

Say you manage Ada’s Acres, a large farming operation in the US Midwest. You’re planning your next planting for your 30 thousand hectares, and as usual, your suppliers are trying to push you new seeds, new herbicides and all manner of hardware and software. Meanwhile, your usual buyers are all calling to let you know that global demand forecasts are through the roof, so you stand to gain a lot of money if you have an outstanding harvest. However, you’ve noticed that the soil has been increasingly poor and in need of fertilizer and that herbicide resistance has increased a lot as well. The weather has been increasingly volatile, and you know it’s a matter of time before you have a major crop failure. Maybe it’s time to start giving regenerative farming a real shot?

Luckily, your farm operations software is now connected to the Gaia Network. It gives you a predictive digital twin of your farm that directly learns not just from every scientific experiment in agronomy, but from the “natural experiments” carried out by every other farm that uses the Network. So you can simulate the effects of any combination of practices, seed strains and products and estimate the outcomes, both short-term (expected yield and probability of crop failure for the next harvest) and long-term (soil health and herbicide resistance).

So that was a “small” (operational) use case. Now let’s zoom out to strategy17: let’s say that you’re the CEO of Acme Foods, a major food company. In light of increased droughts and crop failures, you’re trying to invest in your supply chain to minimize the risk of supply shortages. Your innovation teams have aggregated a long list of potential investments in precision farming, genomics, and regenerative agriculture. In the past, assembling an investment portfolio out of that long list would have required a long, expensive and very political negotiation exercise. Now that all your suppliers are connected to the Gaia Network and share limited access with you, your portfolio management system becomes a distributed digital twin of your supply chain. You can run complex distributed queries across all the nodes, simulate the effects of different investment combinations and different sets of assumptions (like climate and pest spread scenarios), factor in things like unintended consequences, and pull out an aggregate like a Pareto frontier for the investment profile you want.

Most of the demand for the intelligence in the Gaia Network will come from these decision engines (DEs), like the farm operations software and the portfolio management system. Combined with the ability of the Protocol to assign credit where it’s due, it can provide signals and incentives to provide a better supply of intelligence: more and better models in the places where they are most needed by decision-makers. In a future paper, we will further develop our vision of how these signals can be developed into a complete market and contracting mechanism for directing applied research, exploration and analysis: what we call the knowledge economy.

Even further, if we have “non-local” DEs that use Gaia models to design coordinated strategies that internalize the benefits of cooperation between multiple agents, then we can turn those DEs into Gaia models themselves! They become decision models performing “planning as inference” on behalf of agents (individuals and collectives), helping to solve all kinds of principal-agent problems. In the example above, the food company can use a DE not only to infer the best investments for its own goals but also to design adequate contracts and incentives that will best equalize the goals and constraints of all the players in the supply chain. This delegation economy will also be further explored in a future paper.

A distributed oracle for AI safety

The above discussion of decision-making is our link to AI safety. Yoshua Bengio has proposed to tackle AI safety by building an “AI scientist” – a comprehensive probabilistic world model that would serve as a universal gatekeeper to evaluate the safety of every high-stakes action from every AI agent, instead of attempting to design safety into agents. This is similar to Davidad’s Open Agency Architecture (OAA) proposal. But of course, developing such a monolithic, centralized and comprehensive gatekeeper from scratch would be an extremely costly and lengthy undertaking. Further, as Bengio’s proposal makes clear, the AI scientist needs to have “epistemic humility”: its evaluations need to incorporate the limitations and uncertainty of its own model so that it doesn’t confidently allow actions that seem safe at the time but turn out to be unsafe in retrospect.

We argue that the Gaia Network, including the DEs that work as decision models, qualifies perfectly for the job of a distributed AI scientist. The DEs can query the diverse and constantly evolving knowledge in the network to form an “effective world model” with epistemic humility built in. They can provide the demand signals and resources to improve and expand the world model. They can then use this model to simulate counterfactual outcomes of actions that take into account all available local context and dependencies between contexts, and use these simulations to approximately estimate probabilities for outcomes (marginalization). They can factor in the preferences and safety constraints of all agents that use the Network, which they have already shared in order to enable the DEs to help with their own decisions. This gives all the terms in Bengio’s notional risk evaluation formula (adapted from slide 17 here):

Possibly the most important aspect of this design – which comes particularly to light when comparing it to the OAA design – is that none of the above components is specific to AI safety; they are just repurposed from existing and day-to-day use cases for which the users/agents already have the incentives to share the required information with the Gaia Network and the DEs. This means that tackling AI safety is no longer “one of the most ambitious scientific projects in human history”, but rather a “fringe benefit” from our pursuit of knowledge and of better decision-making. And which, in turn, benefits from all improvements to the efficiency and effectiveness of those pursuits that have already been produced by past and ongoing advances in computational statistics and machine learning – and all that will be generated by the Gaia Network connecting and interoperating the many millions of such models in existence, and increasing the RoI of creating and improving models.

This outcome is not dependent on AI safety funders; nor the foibles of political will in the scientific and policy communities; nor the desire of billions of humans to independently share their preferences with an elicitor. All that is required – beyond some cheap work on core infrastructure, modeling and developer experience – is the same economic behaviors and incentives that exist today: the desire for profit, the pursuit of greater scientific knowledge, and the existence of institutions willing and able to internalize the cost of coordinated action.

An overview of this architecture, adapted from our last post, is given below.

A distributed oracle for the Metacrisis

The very same architecture helps us identify shared pathways through the Metacrisis. Below is a nice visual of the high-level causal model we have in mind when thinking about the Gaia Network’s role. By connecting all the relevant domain models and making apparent not only their interdependencies but also their common causes – the “generator functions” or underlying self-reinforcing dynamics – Gaia helps us understand likely future outcomes of the current trends and establish strategies with the highest potential for nudging our global course away from the two catastrophic attractors that currently seem most likely (chaos and totalitarianism). Not only that, but as we’ve seen, Gaia-powered DEs are also used as coordination surfaces: shared tools for establishing and monitoring contracts, treaties and institutions, with unprecedented scale and reliability. While this “infrastructure for model-augmented wisdom” doesn’t immediately or inherently solve conflicts of power and interests, it does provide a consistent, repeatable and scalable institution for achieving and retaining incremental advances towards a positive-sum, cooperative Gaia Attractor.

Source: Adapted from Potentialism, via Sloww

Conclusion: Back from the future

We just claimed that a lot will change in “a few years from now”. How realistic is this? Here’s the really good news: all the capabilities described above can be implemented with today’s technology.18 Not only that: we’re already doing it. We have assembled several organizations and individuals into a growing Gaia Consortium, and have of course been leveraging loads of existing components and building some of our own. Examples:

Ocean Protocol and DefraDB: Decentralized computing and data management.
Fangorn (coming soon): a decentralized platform for building and performing (active) inference on Gaia-connected state space models.
Sentient Hubs: Federated model-based decision support.

We are simultaneously working on specific applications of the Gaia Network, focusing primarily on bioregional economies and sustainable supply chains. These have been useful for providing concrete use cases (some of which we saw above) and resourcing. But ultimately we intend to evolve this into a fully open and collaborative R&D effort to build the general-purpose capabilities described above.

If you’re interested in contributing to this work, here are some possible ways to do it:

Students interested in developing this with us should sign up for the upcoming SPAR program. We’re advising two projects: one focused on outstanding design, engineering and economics issues around using Gaia-based AI scientists for AI safety; the other is more conceptual and centers on formalizing and computationally testing the use of free energy-based causal systems models for measuring AI safety.
If you’re interested in helping design and build the Gaia Network and the surrounding infrastructure, protocols, decision machinery, etc., we gladly accept contributions of all kinds.
If you’d like to use the Gaia Network (or its precursors) in your own use cases, we can happily support standing up “testnets” and help design prototypes and proofs of concepts.
If you have resources to help accelerate development, we can gladly accept grant funding or other forms of support.

If you’re interested in any of the above, please reach out!

Below, Gaia Network and its applications are described both in present and future tense in different narration modes. To avoid confusion, note that Gaia Network is not yet implemented and deployed on a large scale.

In a simple structure like this, a single backward propagation is enough, but there are cases where we need to iteratively update (message passing). For those cases, think of the net flow that is obtained after propagating up and down enough times.

For instance, in variational inference algorithms, the free energy (or stochastic estimates of it) is directly used as a minimization objective. Equivalently, its negative, the Evidence Lower Bound [ELBO], is maximized.

There is an additional concept of free energy associated with decision-making, corresponding to the discrepancy between the veridical posterior justified by priors and observations and the one “desired” in light of a given reward function/model/distribution.

If you already have information that comes from past experiments, or knowledge elicited by independent experts, you can also incorporate it into the priors. The challenge is how to keep track of the grounding behind all of this imported information. This is, in a sense, what the Gaia Network does algorithmically, as we’ll see.

See Criticisms of Meta-Analysis; Meta-analysis: Neither quick nor easy; Meta-analysis. What have we learned?

Actually, even for the parameters of interest in your study, there is a high value in having access to past studies’ posteriors: after having your posteriors “in isolation”, you now want to compare them to previous results in the literature, to check for novelty or consistency.

Technically, a pooled distribution is not a weighted average of distributions (that would be a mixture distribution); instead, it’s a distribution whose parameters are a weighted average (or other combination) of the parameters of the original distributions. Just so we’re clear: here we’re talking about statistical parameters of the posteriors of scientific parameters; for instance, the mean and variance of the average yield.

In practice, different studies often use different model structures and local ontologies. Sometimes these are just syntax differences, such as alternative parameterizations (ex: centered vs uncentered parameters, etc), but often they represent different semantics – different statistical constructs, reflecting differences in context and/or scientific methodology. To enable aggregations to happen between models with these differences, translations are required. To this end, Gaia contributors often publish lens models that perform data translation. As an added benefit of this approach, in cases when there are different semantics that inevitably lead to a loss in translation (as WVO Quine pointed out and Chris Fields has recently formalized), it’s useful for there to be a separate lens model that accounts for and “absorbs” that loss.

This does mean your model is colored by using the informative Gaia posteriors as priors for a parameter of interest. But you can always turn off the annotations for those parameters to isolate the effects of the information contributed by your study (aka the likelihood).

In this example, these are independent scalar parameters, but they could be any multidimensional array with any kind of internal correlation structure.

This is unlike a blockchain, which is designed to ensure that all nodes are “almost always” in full consensus about the entire contents of the global state (which then requires hacks like “L2” chains to improve speed and flexibility).

How? It depends on the parameterization used, but in most cases, partial pooling brings posterior means closer together. You can have parameterizations with multiple modes, like a Gaussian mixture distribution, but this tends to imply that your parameter is actually representing multiple categories instead of a scalar, and should be changed to reflect that.

No matter how small, Gaia eventually propagates every nonzero update to every parameter on the network, so we can have eventual consistency. The protocol can choose to batch small updates for efficiency.

Of course, no one cares about an abstract quantity like FER; they care about concrete advancements in specific areas of science. But that’s the same as saying no one cares about money, but about the goods and services they can buy with it.

That primarily means we’re excluding “teach AI how to play videogames” or “decide which next token to generate for a user” types of scenarios.

We could zoom even further out to tackle the domains of strategy consulting, and ask more “meta” questions. What are the theories of change, how do they connect to each other, how well developed, and how much intensity is in explore vs exploit mode? We will explore these further in a follow-up article.

There are some areas where current solutions aren’t fully adequate, but these are matters of incremental progress, not qualitative breakthroughs.

Institutional economics through the lens of scale-free regulative development, morphogenesis, and cognitive science

Roman Leventov — Tue, 23 Jan 2024 18:18:45 GMT

In this article, I highlight some of the ideas from Musthtaq Khan’s interview for 80000hours podcast about institutional economics 1, political economy, his “political settlement” framework, and the methodology of economics, and connect these ideas to the concepts in scale-free regulative development2, morphogenesis3, and cognitive science4 5.

The history of the socioeconomy matters

Khan points to the fact that socioeconomies are non-ergodic dynamical systems with hysteresis, i.e., memories of their own [1]:

To say that all structures are created by individuals, and therefore if the structure of society in India is different from the one in the United States, then we have to look at the individual incentives that created those structures, I think is a non-starter. It confuses the path dependence of history and the complexity of how structures are built up. Individuals today in India may not have any capacity of changing that structure to look like the one in the U.S. or Norway, not because they have some information deficit or anything like that, but because a structure itself has a reality and a meaning which affects the way individuals behave.

Here’s the exactly parallel observation that Levin [3] makes about morphogenesis:

Development is thus incredibly reliable, producing bodies to very tight tolerance despite considerable deviations and noise at the level of gene expression and cellular activity (Gonze et al. 2018; Eritano et al. 2020; Simon, Hadjantonakis, and Schroter 2018). This robustness, and its occasional failure in the case of birth defects immediately suggests teleonomic perspectives because only goal-directed agents can make mistakes; biophysics alone cannot make mistakes – every micro-scale process proceeds according to the laws of physics and chemistry. Developmental defects are mistakes relative to the correct outcome toward which they strive.

This means that from the perspective of scale-free teleonomy, socioeconomies must be “goal-directed agents that can make mistakes”. The causal (or “creation”) link between individual behaviour and institutions (socioeconomic structures) is bidirectional rather than unidirectional from individual behaviour to institutions.

Khan’s position is that viewing economy exclusively through the lens of individual choice (behaviour) and emergence of supply, demand, and market prices is overly reductive and misses opportunities for intervention and explanation. This non-reductionist approach6 to economics is, of course, shared by Levin with his non-reductionist approach to biology and medicine [3]:

These attempts to view morphogenesis a not merely an emergent physical process but a goal-directed control loop have led to many new discoveries and novel capabilities in the prediction and control of anatomical outcomes that had not been discovered from prior bottom-up approaches, and which offer numerous advantages for regenerative medicine.

Education vs. management know-how and genes vs. cellular control know-how

In the passage quoted above, Khan also seems to suggest that just increasing the level of individual education in the society, though it may help, could be insufficient to change the economic structure and build the economy full of high-capability organisations [1]:

Individuals today in India may not have any capacity of changing that structure to look like the one in the U.S. or Norway, not because they have some information deficit or anything like that, but because a structure itself has a reality and a meaning which affects the way individuals behave.

Individual education (economics) seems to correspond to the knowledge encoded in genes (biology) [3]:

Because genomes encode micro-level protein hardware, not directly specifying growth and form, it is essential to understand not only molecular mechanisms necessary for morphogenesis, but also the information-processing dynamics that are sufficient for the swarm intelligence of cell groups to create, repair, and reconstruct large-scale anatomical features.

However, Khan distinguishes between public or university education with the actual know-how of how to organise a factory, a hospital, a logistics company, and so on, which cannot be learned in school but mostly through practice [1]:

[…] this has nothing much to do with what you learn in business school. It is to do with how you organize a whole team of people to operate seamlessly as an organic whole. And it sounds to us to be rather obvious, but this is an incredibly difficult thing to achieve. Take the example of a hospital in a developing country. Hospitals in developing countries have doctors who are very skilled. In fact, most of them would love to leave and take a job in an advanced country where they would perform perfectly well. They have all the machines that you require for a hospital. They have the drugs, or many or most of them. And yet their capacity to deliver good health services is very poor. The reason is not the quality of the people or the quality of the machines. It’s how it’s organized. Are you doing the cleaning properly? Are you managing the flow of tests so that the right tests go at the right time to the right doctor for the right patient? Are you managing your entry so that the beds are kept just about full enough, but not overly full? Are you managing your quality control and your ordering of spare parts? And this is where it fails. This is where universities don’t work in some countries, hospitals don’t work in other countries. Not because they don’t have professors. I’m from a country where universities don’t work very well, but there are many people like me who are from that country who are quite good professors. But the problem is not the professor. The problem is not the machine or the desk or the whiteboard. The problem is the organization, and how all of this is put together.

In biology, this organisational and managerial know-how corresponds to the capabilities to perform complex, fine-tuned control that neurons, immune cells, and other types of cells acquire in the organism in the process of development while being surrounded by other cells (which could serve as their affordances) and having certain control goals7 8. This knowledge is not encoded in the genes!

Policies and regulations work only when there are networks of horizontal checks and balances that make following the policy mutually advantageous to the players

Khan [1]:

The idea that a policy is a black box that the government just announces and everyone starts following it is a total mistake. A government is just one organization amongst many organizations in society. And there is an interplay going on between governments, political parties, opposition parties, trade unions, churches, mosques, the people, different kinds of agencies and forums… All of them are trying to influence this policy outcome, but also — and this is critical — implementing it on a daily basis. And if the vast majority of your organizations and society are happily violating the rules and not checking each other, then it’s not going to be an implementable policy.

These horizontal networks (other synonyms and terms related to horizontal networks that Khan uses: institutions, social structures, power structures, social organizations, legal systems, incentives, etc.) determine the ecological niches and affordance landscapes within which the players (economic entities, agents) compete and cooperate with each other.

In biological morphogenesis, the equivalents of these horizontal networks and institutions are biochemical and bioelectric networks of communication between cells, tissues, and organs. According to Levin, bioelectric networks are the medium for encoding, communicating, and negotiating the high-level morphological (also metabolic and physiological) goals of the organism that subsystems (organs, tissues, and cells) tap into [3]:

Memory (implemented by bioelectric networks or other mechanisms) is central to teleonomy as a mechanism for encoding future goal states. More generally, however, bioelectric states are a medium that binds individual cells toward large-scale goals – it underlies scale-up (Figure 5) and emergence of higher-level teleonomic individuals (Levin 2019), much as it does to create brains with emergent unified mental content out of a collection of individual neuronal cells. This is why disruptions of bioelectric communication, in the absence of genetic alterations or carcinogens, can initiate cancer in vivo - a shrinking of the size of goals from morphogenetic activity of normal maintenance to unicellular goals of maximum proliferation and migration (metastasis) (Levin 2021b); conversely, forcing appropriate bioelectric communication can normalize cells despite strong expression of oncogenes that otherwise induce tumors (Chernet and Levin 2014, 2013). The framework focused on inflating or shrinking the scale of the teleonomic activity leads directly to novel capabilities, in this case in the context of the cancer problem (Levin 2021b; Moore, Walker, and Levin 2017).

Once the high-level goal (setpoint) is achieved, the bioelectric and biochemical networks maintain the organism in homeostasis. Khan describes how the network of incentives helped to establish and then maintain the effectiveness of skill-training programmes in Bangladesh. This could be seen as maintaining a balance in “workforce metabolism” by the socioeconomy, thanks to the horizontal network of checks and balances.

Advanced organisations and institutions need each other

Another key point in Khan’s political settlement theory is that advanced institutions (such as corporate law) are only demanded by sufficiently advanced (and hence high-capability) organisations. However, these organisations also don’t pop out without the institutions: the latter are necessary for the development of advanced organisations. Thus, socioeconomic development is necessary a gradual, spiral-like process [1]:

[…] powerful organizations which need rules for their own reproduction. They need rules for complex contracting. They need rules to raise finance in complex ways. They need to organize large numbers of people who are not known to them. They’re nameless, faceless people. So you need to have a contract-based rule of law system.
And so in advanced countries, generally speaking, most powerful organizations want a rule of law. And the difference in developing countries is that the powerful organizations are networks, which are informal patronage networks, kin networks, clientelist networks, tribal networks, religious networks, or even companies which are not that capable themselves, and their interactions with each other do not require a rule of law. And therefore a lot of their activity of lobbying, pressuring, and so on is informal. So I don’t think this is very much to do with culture or other kinds of things like that, although they do matter. It’s largely to do with the very nature of development, that developing countries have a preponderance of organizations that don’t need a rule of law. And yet, the fact that they don’t need a rule of law can stop high-capability organizations from developing.

The scale-free regulative development view on this is that economic entities embody multiscale competence architectures (MCAs) a.k.a. Quantum Reference Frames (QRFs) [2]. The more complicated and capable in predicting their environments QRFs are, the more compressed messages they can communicate across the organisational boundary, but to do this effectively they need sophisticated languages or protocols, i.e., institutions.

Also, clearly, advanced organisations need the environment of other organisations that are at least comparably advanced to cooperate and compete with via the institutional mediums.

The morphological space (morphospace) of the socioeconomy is thus the space of organisation of economic and social activity, embodied by entities, sub-entities (agents, humans, and AIs), and the patterns of their interaction: institutions, horizontal networks, corporate, national, and trade boundaries.

Subsidies and protectionism help with industrial development when organisations don’t have the political networks to capture the rent

Khan describes how subsides work to develop the Korean industry because after the WWII the high-level networks between organisations and political figures (i.e., QRFs that are high in the hierarchy) have been decimated because they were previously “implemented with” Japanese who were all gone [1]:

[…] the real difference was the nature of Japanese colonialism in breaking up many horizontal political networks in Korea. And that was because Japanese colonialism was a very aggressive and oppressive form of colonialism. It didn’t rule through intermediate classes in its colonies, it ruled directly, and it ruled through great force and great viciousness, which is why you will not find any Korean today or any Chinese today who has a good word to say about Japanese colonialism, because it was very rough.
One consequence of that was that those horizontal networks — which businesses have with politics and other groups and unions and so on — were actually decimated in South Korea. So, the business groups that emerged in the post-Japanese period did not have the networks to protect their rents, did not have the connections with politics. So, now in the 1960s Park Chung-hee comes on and he starts trying things which are, in a sense, quite obvious. We can’t produce these things, so why don’t we give some export subsidies? Why don’t we give some protection? Why don’t we give them some low-cost loans from the publicly owned banks? Things which every developing country has tried. It’s not rocket science. It’s obvious, you can’t produce these things, your productivity is low, let’s help our businesses.
The difference was not that the South Koreans had innovated something called industrial policy. Everybody and their dog was trying it at that time. In fact, the South Koreans learned a lot from Pakistan, which also had a military government at that time and was doing exactly the same things: export subsidies, import protection, low-cost loans to large business houses, et cetera. […] Why don’t we share these rents and prevent anyone from taking it away? The South Koreans couldn’t do that, because these companies were not connected to the banks, to the politicians, and so on. And therefore, when the state gave these subsidies and they said to them, “You have to achieve these export targets,” there was no way they could protect their rents if they failed to achieve the targets.
These companies were quite happy to give kickbacks, by the way, to Park Chung-hee, to the top leaders. We know this now because there’s a lot of evidence about the corruption in the system at that time, just as we know the corruption in the Chinese system in the 1980s, when it was growing rapidly. The difference is this, if you’re Park Chung-hee and you know this company is not meeting its export target, but is willing to give me a kickback from my subsidy, do I want that, or do I want to give this subsidy to a company which will meet the export target and therefore will make lots of profits and therefore will be able to give me a kickback which is much bigger? Again, it’s not rocket science. If you’re Park Chung-hee you will say, “This is a failed company, it’s just giving me back some of my own money as a kickback. Why should I take that? I’ll close it down and I’ll shift that subsidy, that protection, to some other company.”

This phenomena looks exactly like metamorphosis. To build new advanced structures (high-capability organisation), it’s sometimes not enough just to provide the energy (subsidies): some of the existing advanced structures also need to be destroyed. This is completely obvious when we talk about “spatial” (“substance”) re-structuring (which we may think what mostly is going on in standard biological metamorphosis). The important insight here is that information exchange networks should be thought of just as physical9 as biological cell, tissue, and organ connections, and changing them requires expending free energy.

Methodological pluralism and pragmatism

Khan doesn’t say that his political settlement framework, or institutional economics more generally answer all interesting questions about the economy. Thus, he is against a totalising view that any respectable economic theory should be a “theory of everything”. Instead, he endorses picking up pragmatically any economic theory that already exists and seems best equipped to answer this or that question that the economist has, very much in the “all models are wrong, but some are useful” style [1]:

I don’t think it’s actually even feasible to have a political economy textbook, because it would be too big. There are too many different questions. Political economy is basically about how the world works. You can’t put that in a textbook. There are going to be different positions. Those different positions are not a problem. They are a good thing. Because nobody knows the truth. Nobody knows the right answer. But we each have our own angle, our own take. We can explain lots of things, but others can explain things in a slightly different way. And so I would say a good policymaker needs to have an awareness of different schools of thought, different methods, and so on. For some problems, the neoclassical approach might be actually quite good. For other approaches, it might be not only useless but dangerous. It might actually make things worse.
[…] I think each framework has some basic questions that they’re asking, and then they’re answering them in some ways. If you look at the political settlements framework, it is asking a number of questions about policy implementation, and the interface between organizational power and institutions, and then asking you to locate your policy in a way that will make incremental changes effective. Then everything else follows from that. You need lots of different building blocks to make sense of it. As you rightly say, you can draw on neoclassical economics, you can draw on other political economy frameworks.
[…] In the same way, if you look at Acemoglu, Johnson, and Robinson, their framing question is, “Are your institutions extractive or inclusive?” And then everything else is circling around that. Or is it a limited access order or an open access order? What are the different types of limited access orders, and what are the doorstep conditions?
[…] In neoclassical economics, the organizing principle is that people negotiate their own agreements at an individual level. The market is nothing but a set of contracts between people which are voluntarily made. You can explain quite a lot in terms of the voluntary contracts that people make, and the reasons why they can’t make those voluntary contracts. But at a deeper level, you’re not asking, “Why do people behave like that? Which contracts are enforced? Which contracts aren’t enforced?” As soon as you start asking that, you can’t just look at individuals. You’re looking at power structures. You’re looking at society. You’re looking at history.
In that sense, neoclassical economics is at the bottom of this food chain in asking about the individual contracting, which is really important and useful, but, actually, most of the interesting questions are about why those contracts aren’t enforced. That is how Douglass North began the journey of institutional economics in saying property rights and contracts may exist but they’re often not enforced.
Enforcement, I think, is the key. Enforcement takes you to all of those political economy questions which different people are cutting in different ways. But if you look at all the different political economy frameworks, just asking the same basic question, “How is the organization of society and of the collective organization affecting how individuals behave?” Whereas the neoclassical is starting from the other end, saying, “Let’s take preferences as given. Let’s take the constraints as given. This is how individuals behave.” And political economy is saying, “That’s trivial. You forgot the most important parts of the story of how we got there. How they finally make the contract is the most trivial part of the story.”
I think all of these things are connected. But I think that the really important questions about how social organization/social power affect how individuals behave, contract, the belief systems they have, how they enforce rules on each other, how they punish each other, these are historically specific questions to which you cannot have a general theory. […]

Open questions for future research

Morphological intelligence of socioeconomies

Morphological intelligence is the ability of an organism to solve problems in the morphospace, such as metamorphosis and regeneration. This is an important concept in Levin’s framework. It sounds like it would be good for our socioeconomies to be highly morphologically intelligent and be able to reinvent themselves without revolutions or special circumstances, such as a withdrawal of an oppressive colonial regime, such as Japanese regime in Korea discussed above. Note that morphological intelligence is separate from metacognition, that seems to correspond to governance and political processes in socioeconomies.

As far as I understand, it’s not yet clear what permits organisms of some species, such as frogs10 and salamanders11 to be more morphologically intelligent than others, such as humans. However, if it is confirmed to be in some sense strongly correlated with the complexity of the signals that are processed and exchanged by the constituent elements of morphologically intelligent organisms, this could be bad news for the complexity of human intelligence and consciousness. Some food for analogical thought in this direction: colonies of eusocial insects can regenerate large parts of their structure and “headcount”, i.e., are morphologically intelligent, although bees are sometimes considered as highly intelligent among insects, unlike ants. Though, of course, insects are vastly dumber than humans.

Note that this would not contradict the experience of great collectivist, development, organisational projects of the 20th century, such as the industrialisation in the Soviet Union which are commonly believed to be bottlenecked (or made very inefficient) by poor education of the people: the ability to organise (development) is not the same as the ability to *re-*organise (morphological intelligence)! Nor would this potential negative finding (of requiring relatively “dumb” units for achieving high morphological intelligence on the collective level) repeat the common idea in dystopian sci-fi literature that people need to be “dumbed down” to ensure control, stability, and the “common good”: actually, a morphologically intelligent socioeconomy would be agile and resilient to external challenges, rather than be extremely static and brittle, as commonly depicted in dystopias. Still, there are definitely dystopian overtones to this idea and I think it deserves serious thought. Cf. the last paragraph here 12 and Beren Millidge’s “BCIs and the ecosystem of modular minds”.

The morphological intelligence of socioeconomic structures may also depend on the scale: at least some companies, such as Amazon, Microsoft, and Adobe have historically demonstrated impressive agility and capacity to reorganise in response to changing environment, and this depended on their highly intelligent employees! So, low morphological intelligence of states may tell us more about these constructs than about the future of human intelligence and consciousness.

Hyper-developmental biology and the distribution of organisational capabilities

Hyper-developmental biology has been proposed by Levin 13 to describe the collective intelligence of a group of embryos developing together and sharing boundaries. This must be closely related to Khan’s emphasis on the distribution of organisational capabilities as the key to economic development and prosperity itself [1]:

The aim is to help the development of capabilities and organizations with capabilities that eventually results in higher levels of welfare for people, better lives.

[…] it comes back to what your normative goals of development are. And I’ve given you my normative goal. My normative goal is, I want to see a broader spread of organizational capabilities. Because I think that underpins almost everything else. Even if you wanted to fight poverty, I would ask how do I get poor people to have the capabilities to produce things for themselves and engage in activities for themselves, rather than just give handouts.

The observation that advanced organisations need (or “want”) other advanced organisations to be their suppliers, partners, and consumers that I noted in the section “Advanced organisations and institutions need each other” is just the simplest one. But I think something more insightful could be learned from biology on this topic. Or, vice versa, something could be learned from economics and applied to biology? For example, could the surprising result that cross-embryo morphogenetic assistance is more effective when all embryos are exposed to a chemical that disturbs normal embryo development than when only half of the embryos are exposed14 be interpreted in economic terms, such as that “self-interested” embryos, like companies, are usually not eager to help their competitors when the latter are suddenly affected by some calamity?

People have started to discuss the economic analogy in the comments to Levin’s post already!

Lessons for the AI transition of the socioeconomy

I’m primarily interested in the regulative development framework that I connected with institutional economics above insofar as what it tells us about the likely trajectory of the AI transition of the socioeconomies, and most importantly, how to design policies that will achieve the intended developmental goals in the context of AI transition, which is the main focus of Khan’s work.

How to prevent pathological growth of the socioeconomy due to the influx of AI agents (a la Mustafa Suleyman’s ACIs)?
What kind of language will help autonomous organisations to persuade each other rationally [4] and negotiate nuanced and ethical collective goals? See also the discussion of this theme in “Worrisome misunderstanding of the core issues with AI transition”.
How to prevent shrinkage and simplification of human participation in the economy (e.g., due to cognitive globalisation 15) that will lead to the reduction of people’s bargaining power and will adversely affect and may even destabilise the political balance and governance institutions?

Khan, M. & Wiblin, R. (2021). Mushtaq Khan on using institutional economics to predict effective government reforms

Fields, C., & Levin, M. (2022). Regulative development as a model for origin of life and artificial life studies [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/rdt7f

Levin, M. (2023). Collective Intelligence of Morphogenesis as a Teleonomic Process (pp. 175–198). https://doi.org/10.7551/mitpress/14642.003.0013

Levin, M. (2022). Technological Approach to Mind Everywhere: An Experimentally-Grounded Framework for Understanding Diverse Bodies and Minds. Frontiers in Systems Neuroscience, 16, 768201. https://doi.org/10.3389/fnsys.2022.768201

Friston, K. J., Ramstead, M. J. D., Kiefer, A. B., Tschantz, A., Buckley, C. L., Albarracin, M., Pitliya, R. J., Heins, C., Klein, B., Millidge, B., Sakthivadivel, D. A. R., Smithe, T. S. C., Koudahl, M., Tremblay, S. E., Petersen, C., Fung, K., Fox, J. G., Swanson, S., Mapes, D., & René, G. (2022). Designing Ecosystems of Intelligence from First Principles (arXiv:2212.01354). arXiv. http://arxiv.org/abs/2212.01354

This approach is also called methodological holism. However, Khan also recognises individualist explanations of the socioeconomic phenomena: see the “Methodological pluralism and pragmatism” section below.

Fields, C., Friston, K., Glazebrook, J. F., Levin, M., & Marcianò, A. (2022). The free energy principle induces neuromorphic development. Neuromorphic Computing and Engineering, 2(4), 042002. https://doi.org/10.1088/2634-4386/aca7de

Fields, C., Fabrocini, F., Friston, K., Glazebrook, J. F., Hazan, H., Levin, M., & Marcianò, A. (2023). Control flow in active inference systems. OSF Preprints. https://doi.org/10.31219/osf.io/8e4ra

Fields, C., Glazebrook, J. F., & Marciano, A. (2022). The Physical Meaning of the Holographic Principle. Quanta, 11(1), 72–96. https://doi.org/10.12743/quanta.v11i1.206

Vandenberg, L. N., Adams, D. S., & Levin, M. (2012). Normalized shape and location of perturbed craniofacial structures in the Xenopus tadpole reveal an innate ability to achieve correct morphology. Developmental Dynamics: An Official Publication of the American Association of Anatomists, 241(5), 863–878. https://doi.org/10.1002/dvdy.23770

Arenas Gómez, C. M., & Echeverri, K. (2021). Salamanders: The molecular basis of tissue regeneration and its relevance to human disease. Current Topics in Developmental Biology, 145, 235–275. https://doi.org/10.1016/bs.ctdb.2020.11.009

Kaufmann, R., & Leventov, R. (2023). Gaia Network: A practical, incremental pathway to Open Agency Architecture. https://www.lesswrong.com/posts/AKBkDNeFLZxaMqjQG/gaia-network-a-practical-incremental-pathway-to-open-agency

Levin, M. (2024, January 17). What groups of embryos know: Toward a hyper-developmental biology. Forms of Life, Forms of Mind. https://thoughtforms.life/what-groups-of-embryos-know-toward-a-hyper-developmental-biology/

Tung, A., Sperry, M. M., Clawson, W., Pavuluri, A., Bulatao, S., Yue, M., Flores, R. M., Pai, V. P., McMillen, P., Kuchling, F., & Levin, M. (2024). Embryos assist morphogenesis of others through calcium and ATP signaling mechanisms in collective teratogen resistance. Nature Communications, 15(1), Article 1. https://doi.org/10.1038/s41467-023-44522-2

Leventov, R. (2022). Properties of current AIs and some predictions of the evolution of AI from the perspective of scale-free theories of agency and regulative development. https://www.lesswrong.com/posts/oSPhmfnMGgGrpe7ib/properties-of-current-ais-and-some-predictions-of-the

Worrisome misunderstanding of the core issues with AI transition

Roman Leventov — Fri, 19 Jan 2024 10:05:43 GMT

This post is triggered by “Generative AI dominates Davos discussions as companies focus on accuracy” (CNBC) and “AI has a trust problem — meet the startups trying to fix it” (Sifted).

It's just remarkable (and worrying) how business leaders and journalists misunderstand the core issues with AI adoption and transition1. All they talk about is "accuracy", "correctness", and "proving that AI is actually right"(!). The second piece has a hilarious passage “Cassar says this aspect of AI systems creates a trust issue because it goes against the human instinct to make ‘rule-based’ decisions.”(!)

There are many short- and medium-term applications where this "rule-following and accuracy" framing of the issue is correct, but they are all, by necessity, about automating and greasing bureaucratic procedures and formal compliance with rule books: filing tax forms, checking compliance with the law, etc. But these applications are not intrinsically productive, and on a longer time scale, they may lead to a Jevons effect: the cheaper bureaucratic compliance becomes, the more it is demanded, without actually making coordination, cooperation, and control more reliable and safe.

"Factual accuracy" and hallucinations are the lowest-hanging pieces of context alignment

Taking the viewpoints of information theory2, philosophy of language, and institutional economics3, it's not the sophistication of bureaucracies that reduces the cumulative risk exposure and transaction costs of the interaction between humans, AIs, and organisations. Rather, it's building shared reference frames (shared language) for these agents to communicate about their preferences, plans, and risks. The sophistication of bureaucratic procedures sometimes does have this effect (new concepts are invented that increase the expressiveness of communication about preferences, plans, and risks), but this is only an accidental byproduct of this bureaucratisation process. And then, making AIs use language effectively to communicate with humans and each other is not an "accuracy" or "factual correctness" problem, it's the context (meaning, intent, outer) alignment problem.

Indeed, this is the core problem that Perplexity, Copilot, Bard, OpenAI, and other universal RAG helpers are facing: alignment with users’ context, on a hierarchy of timescales: pre-training4, fine-tuning5, RAG dataset curation, and online alignment through a dialogue with the user6. Preventing outright hallucinations is just the lowest-hanging part of this problem. And “aligning LLMs with human values” is hardly a part of this problem at all. Perhaps, the fact that this kind of “value alignment” is surprisingly ineffective in combatting jailbreaks evidences that jailbreaks expose the deeper problem, that is, misunderstanding of the user’s context (and therefore user’s intent, which is in the coupling between the user and their environment/context, from the enactivist perspective).

Then, as far as scale-free biology and Active Inference agency are concerned7 8 9, there is no difference between understanding a context and alignment with a context, and hence we have the Waluigi effect that can only be addressed on the meta-cognitive level (output filters, meta-cognitive dialogue, and other approaches). Therefore, sharing arbitrary capable “bare” LLMs in open-source is inherently risky and there is no way to fix this with pre-training or fine-tuning. Humans have evolved to have obligatory meta-cognition for a good reason!

Real “safe and reliable reasoning” is compositional reasoning and provably correct computing

It's richer language, better context alignment, and better capacities for (compositional, collective) reasoning, bargaining, planning, and decision-making10 that make the economy more productive and civilisation (and Gaia) safer at the end of the day, not “better bureaucracies”. To a degree, we can also think about bureaucracies as scaffolding for "better reasoning, bargaining, planning, and decision-making". There is some grain of truth in this view, but again, nobody currently thinks about bureaucracies, rule books, and compliance in this way and this only happens as an accidental side-effect of bureaucratisation.

In this sense, making LLMs "accurate" and "correct" followers of some formal rules hardly moves the needle of reasoning correctness (accuracy) forward. The right agenda for improving the correctness and accuracy of reasoning is scaffolding it in (or delegating it to) more "traditional" computing paradigms: symbolic and statistical, such as algorithms written in probabilistic programming languages (calling to NN modules), or other neurosymbolic frameworks, and generating mathematical proofs of correctness for the algorithms11. The last two miles of safety on this agenda would be

Proving the NN components themselves by treating them as humongous but precise statistical algorithms to rule out some forms of deceptive alignment 12, and
Generating proofs for hardware correctness and tamper-safety that is going to run the above software (see Tegmark and Omohundro, 2023).

The bottom line: AI safety = context alignment + languages and protocols + provably correct computing + governance mechanisms, incentive design, and immune systems

AI safety =
Context alignment throughout pre-training, fine-tuning, and online inference +
Languages and protocols for context alignment and (collective) reasoning (negotiation, bargaining, coordination, planning, decision-making) about preferences, plans, and risk bounding to make them (alignment and reasoning) effective, precise, and compositional +
Provably correct computing +
(Not covered in this post) governance mechanisms, incentive design, and immune systems to negotiate and encode high-level, collective preferences, goals, and plans and ensure that the collective sticks to the current versions of these.13

Note that “control”, “rule following” (a.k.a. bureaucratisation), “trust”, and “value alignment” are not parts of the above decomposition of the problem of making beneficial transformative AI (cf. Davidad’s AI Neorealism). They in some sense emerge or follow from the interaction of the components listed above.

In general, I’m a methodological pluralist and open to the idea that “control” and “value alignment” frames capture something about AI safety and alignment that is not captured by the above decomposition. Still, I think it is very small and not commensurate to the attention share that these frames receive from the public, key decision-makers, and even the AGI labs and the AI safety community. This is ineffective and could also instil dangerous overconfidence and delude decision-makers and the public about the actual progress on AI safety and the risks of the AI transition.

Even then, bureaucratisation is probably just net harmful.

“Trust”, while important from the sociotechnical perspective and for optimal adoption of the technology, should not result in oversimplification of algorithms and concepts so that people understand them: this would just increase the “alignment tax” and would be ultimately futile, and also unnecessary if we have mathematical proofs for the correctness of protocols and algorithms. So, I think that to address the trust issue, AI developers and the AI community will ultimately need to educate decision-makers and the public about the difference between “trust in science” (context alignment) and “trust in math” (algorithms and computing), being vigilant about the former, and not unduly questioning the latter.

This post has been originally published on LessWrong.

I realise that business leaders may also be not interested in this problem, but then it’s our, i.e., AI (safety) community’s and the public problem to influence the businesses to recognise the problem, or else businesses will externalise the risks onto all of us.

Khan, M. & Wiblin, R. (2021). Mushtaq Khan on using institutional economics to predict effective government reforms

This is what OpenAI’s Superalignment agenda is about.

This is what Stuart Armstrong and Rebecca Gorman’s Aligned AI seems to tackle.

This online inference is usually called “in-context learning” for LLMs, though note that the meaning of the word “context” is very different in this phrase from the meaning of “context” in quantum free energy principleand information theory.

Fields, C., Friston, K., Glazebrook, J. F., & Levin, M. (2022). A free energy principle for generic quantum systems. Progress in Biophysics and Molecular Biology, 173, 36–59. https://doi.org/10.1016/j.pbiomolbio.2022.05.006

Pezzulo, G., Parr, T., Cisek, P., Clark, A., & Friston, K. (2023). Generating Meaning: Active Inference and the Scope and Limits of Passive AI [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/8xgzv

Fields, C., & Levin, M. (2020). How Do Living Systems Create Meaning? Philosophies, 5. https://doi.org/10.3390/philosophies5040036

Open Agency Architecture, Gaia Network, Friston et al.’s Ecosystems of Intelligence, Infra-Bayesianism, and probably Conjecture’s CoEms are the agendas that I’m aware of that approach the design of effective, precise, and compositional (collective) reasoning languages and protocols.

Tegmark, M., & Omohundro, S. (2023). Provably safe systems: The only path to controllable AGI (arXiv:2309.01933). arXiv. https://doi.org/10.48550/arXiv.2309.01933

However, I don’t know how to deal with evolutionary and dynamic NN architectures, such as Liquid.ai.

Governance mechanisms should also include secession protocols for hedging against value lock-in and meta-ethical opportunity cost, but this is far outside the scope of this post.

AGI will be made of heterogeneous components

Roman Leventov — Wed, 27 Dec 2023 17:15:23 GMT

This post is prompted by two recent pieces:

First, in the podcast "Emergency Pod: Mamba, Memory, and the SSM Moment", Nathan Labenz described how he sees that we are entering the era of heterogeneity in AI architectures because currently we have not just one fundamental block that works very well (the Transformer block), but two kinds of blocks: the Selective SSM (Mamba) block has joined the party. These are natural opposites on the tradeoff scale between episodic cognitive capacity (Transformer's strong side) and long-term memorisation (selective SSM's strong side). So, we will probably quickly see the emergence of complicated hybrids between these two, trying to get the best from both types of blocks1.

This reminds me of John Doyle's architecture theory that predicts that AI architectures will evolve towards modularisation and component heterogeneity, where the properties of different components (i.e., their positions at different tradeoff spectrums) will converge to reflect the statistical properties of heterogeneous objects (a.k.a. natural abstractions, patterns, "pockets of computational reducibility") in the environment.

Second, in this article, Anatoly Levenchuk rehearses the "no free lunch" theorem and enumerates some of the development directions in algorithms and computing that continue in the shadows of the currently dominant LLM paradigm, but still are going to be several orders of magnitude more computationally efficient than DNNs in some important classes of tasks: multi-physics simulations, discrete ("system 2") reasoning (planning, optimisation), theorem verification and SAT-solving, etc. All these diverse components are going to be plugged into some "AI operating system", Toolformer-style. Then Anatoly posits an important conjecture (slightly tweaked by me): as it doesn't make sense to discuss some person's "values" without considering (a) them in the context of their environment (family, community, humanity) and (b) their education, it's pointless to discuss the alignment properties and "values" of some "core" AGI agent architecture without considering the whole context of a quickly evolving "open agency" of various tools and specialised components2.

From these ideas, I derive the following conjectures about an "AGI-complete" architecture3:

1. AGI could be achieved by combining just
(a) about five core types of DNN blocks (Transformer and Selective SSM are two of these, and most likely some kind of Graph Neural Network with or without flexible/dynamic/"liquid" connections is another one, and perhaps a few more)4;
(b) a few dozen classical algorithms for LMAs aka "LLM programs" (better called "NN programs" in the more general case), from search and algorithms on graphs to dynamic programming, to orchestrate and direct the inference of the DNNs; and
(c) about a dozen or two key LLM tools required for generality, such as a multi-physics simulation engine like JuliaSim, a symbolic computation engine like Wolfram Engine, a theorem prover like Lean, etc.

2. The AGI architecture described above will not be perfectly optimal, but it will probably be within an order of magnitude from the optimal compute efficiency on the tasks it is supposed to solve^{(see footnote 3)}, so, considering the investments in interpretability, monitoring, anomaly detection, red teaming, and other strands of R&D about the incumbent types of DNN blocks and NN program/agent algorithms, as well as economic incentives of modularisation and component re-use (cf. "BCIs and the ecosystem of modular minds"), this will probably be a sufficient motivation to "lock in" the choices of the core types of DNN blocks that were used in the initial versions of AGI.

3. In particular, the Transformer block is very likely here to stay until and beyond the first AGI architecture because of the enormous investment in it in terms of computing optimisation, specialisation to different tasks, R&D know-how, and interpretability, and also, as I already noted above, because Transformer maximally optimises for episodic cognitive capacity and from the perspective of the architecture theory, it's valuable to have a DNN building block that occupies an extreme position on some tradeoff spectrum. (Here, I pretty much repeat the idea of Nathan Labenz, who said in his podcast that we are entering the "Transformer+" era rather than a "post-Transfromer" era.)

Implications for AI Safety R&D

The three conjectures that I've posited above sharply contradict another view (which seems to me broadly held by a lot of people in the AI safety community) in which a complete overhaul of the AI architecture landscape is expected when some new shiny block architecture that beats all the incumbents will be invented5.

It's hard for me to state the implications of taking one side in this crux in the abstract, but on a more concrete example, I think this position informs my inference that working on an architecture that combines Transformer and Selective SSM blocks and training techniques to engineer an inductive bias for greater "self-other overlap" is an R&D agenda with a relatively high expected impact. Compare with this inference by Marc Carauleanu (note: I don't state that he necessarily expects a complete AI architecture overhaul at some point, but it seems that somebody who thought that would agree with him that working on combining Transformer and Selective SSM blocks for safety is of low expected impact because the AGI that might make a sharp left turn will contain neither Transformer nor Selective SSM blocks).

System-level explanation and control frameworks, mechanism design

Both Drexler's Open Agency Model and Conjecture's CoEms are modular and heterogeneous as I predict the AGI architecture will be anyway, but I remarked in the comments to both that component-level alignment and interpretability is not enough to claim that the system as a whole is aligned and interpretable (1, 2).

My conjectures above call for more work on scientific frameworks to explain the behaviour of intelligent systems made of heterogeneous components (NNs or otherwise), and engineering frameworks for steering and monitoring such systems.

On the scientific side, see Free Energy Principle/Active Inference in all of its guises, Infra-Bayesianism, Vanchurin’s theory of machine learning (2021), James Crutchfield's "thermodynamic ML" (or, more generally, Bahri et al.’s review of statistical mechanics of deep learning (2022)), Chris Fields' quantum information theory, singular learning theory. (If you know more general frameworks like these, please post in the comments!)

On the engineering (but also research) side, see Doyle's system-level synthesis, DeepMind's causal incentives working group, the Gaia Network agenda, and compositional game theory. (If you know more agendas in this vein, please post in the comments!)

Implications for AI policy and governance

The view that AGI will emerge from a rapidly evolving ecosystem of heterogeneous building blocks and specialised components makes me think that "intelligence containment", especially through compute governance, will be very short-lived.

Then, if we assume that the "G factor" containment is probably futile, AI policy and governance folks should perhaps start paying more attention to the governance of competence through the control of the access to the training data. This is what I proposed in "Open Agency model can solve the AI regulation dilemma".

In the Gaia Network proposal, this governance is supposed to happen at the arrow from "Gaia Network" to "Decision Engines" that is labelled "Data and models (for simulations and training)" (note that "Decision Engines" are exactly the "AGI-complete" parts of this architecture, not the Gaia agents):

However, we didn't think about a concrete governance mechanism for this yet, and welcome collaborators to discuss it.

I've proposed one way to combine Transformer and Selective SSM in SociaLLM.

Anatoly also connects this trend towards AI component and tool diversification with the "Quality Diversity" agenda that looks at this component and architecture diversity as intrinsically advantageous even for capabilities.

"AGI" is taken here from OpenAI's charter: "highly autonomous systems that outperform humans at most economically valuable work". This is an important qualification: if we were to create an AI that should outperform all biological intelligence in all of its tasks in diverse problem spaces (such as protein folding, genetic expression, organismic morphology, immunity, etc.), much more component diversity would be needed that I conjecture below.

Here, it's important to distinguish the block architecture from the training objective. Transformers are not obliged to be trained solely as auto-regressive next token predictors; they can also be the working horses of GFlowNets that have different training objectives.

Additionally, it's sometimes assumed that this invention and the AI landscape overhaul will happen during the recursive self-improvement a.k.a. the autonomous takeoff phase.