Introduction
In early 2026, while most companies were still relying on data analysts to manually write SQL queries, OpenAI revealed a data analysis agent capable of independent thinking, reasoning, and self-evolution, reducing data query times from days to minutes.
The Challenge of Data Queries
Data teams often face challenges not due to insufficient computing power, but because of the vast number of tables, definitions, and scattered experiences. For instance, the term “active users” can have completely different meanings across various tables. Even if the right table is selected, writing hundreds of lines of SQL can be necessary to produce results, and a single incorrect join condition can invalidate the entire effort.
Internally, OpenAI has taken a radical step: using a Codex-driven data agent to manage the entire process of “finding tables, understanding tables, writing SQL, and validating results” through a six-layer contextual architecture. This approach enriches data semantics, integrates organizational knowledge, and consolidates experiential memory, allowing engineers to ask questions instead of performing manual tasks.


Automating Data Queries
“We have many structurally similar tables, and I spend a lot of time trying to understand their differences and which one to use,” lamented an OpenAI engineer, capturing the common plight of data workers. OpenAI’s internal data platform contains 600PB of data across 70,000 datasets. Imagine when engineers need to analyze ChatGPT user growth, facing dozens of similar user tables, each claiming to record “user activity” but with differing definitions.
Choosing the wrong table can mean days of effort wasted, and worse, it could lead to critical decisions based on incorrect data.
Even when the correct table is chosen, generating accurate results can be challenging. A complex SQL statement of over 180 lines can feel like an insurmountable mountain—any minor error could render the entire analysis ineffective.

With the Codex-driven intelligent agent, engineers no longer need to write hundreds of SQL queries; they can simply ask questions to find the information they need from the data ocean, such as comparing active user counts at two different points in time.


Six-Layer Contextual Architecture
Many tools exist to convert natural language into SQL statements, but the core innovation of OpenAI’s internal data agent lies in its multi-layer contextual architecture.

The foundational layer consists of basic metadata, including table structures and column types, providing the skeleton for the data graph.
The next layer involves human annotations crafted by domain experts, capturing intent, semantics, business meanings, and known considerations that cannot be easily inferred from patterns or historical queries. This layer essentially provides foundational training for the agent regarding each table’s information.

The subsequent Codex enhancement layer derives code-level definitions of tables, allowing the agent to gain deeper insights into the actual content of the data. This layer offers critical information about value uniqueness, data update frequency, and data range. Its introduction enables the agent to understand differences in table construction and updates.
Above this is the organizational knowledge layer, where the agent can access Slack, Google Docs, and Notion to obtain key company background information, such as product releases, reliability incidents, internal codenames, and definitions and calculation logic for key metrics.

With external text-derived background information, the agent avoids common sense errors. For example, when a user asks, “Why did connector usage drop significantly in December?” the agent does not simply report the number’s decline but identifies it as primarily a measurement/logging issue rather than a real collapse in usage, related to changes in data collection due to the ChatGPT 5.1 release.
The most critical fifth layer is the learning evolution, which grants the agent persistent memory. When it receives corrections from users or notices subtle differences in data issues, it can retain these experiences for future use. Memory can also be created and edited manually by users, applicable globally or unique to specific users.

The top layer, runtime context, allows the agent to perform real-time queries to check and query tables when existing context or information is lacking. It can also communicate with other data platform systems (metadata services, Airflow, Spark) to obtain broader data context.

Dynamic Switching Between Offline Retrieval and Online Queries
How do these six layers work together?
The process can be divided into offline and online steps. Each day at dawn, the agent systematically scans thousands of data tables’ actual usage and calling trajectories from the previous day, absorbing annotations and insights left by data experts, and invokes Codex to interpret the logic buried in the code, deriving richer business semantics behind the tables. All these scattered “knowledge fragments” are merged into a unified, standardized “knowledge graph.”
Subsequently, through OpenAI’s embedding model, this information is transformed and compressed into groups of vector embeddings stored in a high-speed retrieval library. Thus, a readily available “data memory palace” for the AI agent is established.

When a user’s question arrives, the agent no longer needs to dive into the vast sea of metadata for time-consuming manual retrieval. Instead, it employs retrieval-augmented generation techniques to precisely locate and extract the most relevant data tables for the current question. This process is fast, scalable, and has low latency.
For requests requiring the latest data, the agent simultaneously activates a real-time query channel, directly querying the data warehouse. This achieves both the immediacy of runtime context and deep integration with offline knowledge. Consequently, a complex business question can be transformed into clear insights available in seconds through the collaboration of offline memory’s “lightning retrieval” and real-time data’s “precise guidance.”

Paradigm Shift from Static Tools to Dynamic Team Members
What is most impressive about this intelligent agent is not its technical complexity, but how it integrates into daily workflows, becoming a true “teammate.” Unlike traditional “question-and-answer” tools, OpenAI’s data analysis agent is designed to be a “teammate with whom one can reason.” It is conversational, always online, capable of handling quick answers as well as iterative exploration.
Imagine a scenario where a product manager’s question is unclear or incomplete; the agent proactively asks clarifying questions. If there is no response, it applies reasonable default values to advance the work. For example, if a user inquires about business growth without specifying a date range, it might assume the last seven or thirty days. This allows the agent to maintain a balance between responding and collaborating with the user to achieve more accurate results.
To prevent the ever-evolving agent from going off track during its learning process, the OpenAI team employs the Evals API to provide a strict overseer for the agent. Each significant question is paired with manually crafted queries serving as “gold standards,” and the agent’s performance is continuously monitored and rated.

These evaluations check not only the correctness of SQL syntax but also compare the accuracy of result data. When the agent “misbehaves,” the system immediately raises an alert, ensuring issues are identified and resolved before impacting users.
In terms of data security, the agent ensures that users can only query tables they have permission to access. When access rights are missing, it marks this point or falls back to alternative datasets that the user is authorized to use.
To ensure transparency in the data analysis process, the agent summarizes assumptions and execution steps alongside each answer to expose its reasoning process. When a query is executed, it directly links to the underlying results, allowing users to check the original data and verify each step of the analysis.

Building a Data Analysis Agent
OpenAI’s data analysis agent is not open-source, but if you want to build a similar agent, OpenAI’s engineers have shared some pitfalls they encountered.

Initially, the agent had access to the complete dataset, but this quickly led to confusion among overlapping data tables. To reduce ambiguity and enhance reliability, developers had to restrict the tables the agent could access, thereby improving query reliability.
Another pitfall arose from highly structured system prompts provided by developers. While many questions share similar analytical shapes, the details vary enough that rigid instructions can backfire. Focusing on the effects in real usage and allowing the agent to determine how to achieve results rather than relying on system-level prompts makes the agent more robust and produces better outcomes.
The most critical point is realizing that the true meaning of data lies in the code rather than expert annotations of data tables. Query histories describe the shape and usage of tables more accurately, capturing assumptions and business intentions that never surfaced in SQL or metadata. By using Codex to crawl the codebase, the agent can understand how datasets are actually constructed and better infer the actual contents of each table. This approach provides more accurate answers to questions like “What is in this table?” and “When can I use it?” compared to merely retrieving information from the data warehouse.
As enterprise data environments become increasingly complex, tools like OpenAI’s data agent may become standard configurations for future enterprise data analysis, driving the industry towards a more efficient and intelligent data-driven decision-making paradigm.
The goal of these agents is not to replace data analysts but to enhance their capabilities, freeing them from tedious query writing and debugging to focus on higher-level tasks such as defining metrics, validating hypotheses, and making data-driven decisions.
Comments
Discussion is powered by Giscus (GitHub Discussions). Add
repo,repoID,category, andcategoryIDunder[params.comments.giscus]inhugo.tomlusing the values from the Giscus setup tool.