In the new world of agentic AI, the discussion has revolved around data: governance, storage, and compute. But what about metadata — the data about data?
Metadata has been a second-class citizen, according to Junping (JP) Du, founder and CEO of Datastrato, a data and AI infrastructure company.
AI is changing how data — and metadata — is consumed, understood, and governed, so Datastrato created Apache Gravitino, an open source project that serves as a high-performance, geo-distributed, federated metadata lake.
The project is designed to be a single-engine, neutral control plane for metadata and governance, tailored to the needs of multimodal, multi-engine AI workloads.
Last year was a big one for Gravitino. In June, it graduated as an Apache Top Level Project. In December, it delivered its first major stable release, version 1.1.0. At the start of 2026, it joined the brand new Agentic AI Foundation.
Gravitino, Du says in this episode of The New Stack Makers, is a “catalog of catalogs, because we try to solve the problems of running the data and AI platforms more safely and consistently.”
In the age of AI, Du says, “We need more engine-friendly or agent-friendly metadata and try to unify everything together and [provide] the technical metadata to the engine support as a first-class citizen.”
Gravitino builds a unified data catalog, regardless of whether the data is traditional, structured, or multi-modal.
“We all take [these] kind of data formats, and we allow the multiple engines to access this kind of data, so there’s no data silo anymore,” Du says. “And also it can be easy to consume by AI agents — instead of previously, having to be building everything to be at the data warehouse and consume from there.”
Tackling metadata’s governance problem
Du — who spent about 15 years building data infrastructure for the Apache Hadoop project — and Jerry (Saisai) Shao, co-founder and CTO of Datastrato, leaned on their long experience in building cloud data warehouses and lake houses in creating Gravitino.
As data and AI systems grew in complexity, engineers encountered recurring problems. “The first [problem] is actually data: It’s spread across multiple engines like Spark, Trino, or even some runtimes like Ray, PyTorch.
“And another problem is the metadata … It’s a siloed catalog instead of a unified catalog to know everything. So, that means the governance, access controls, and even the semantics are hard to build in efficient ways.”
Metadata, Du adds, can be duplicated or inconsistent.
AI makes the problem worse, he says, “especially for unstructured data, because it’s hard to manage in a typical way.” In a production environment, especially at enterprise scale, he added, it’s hard to find a single point of truth to define what data exists, how it can be accessed, and how it can be governed.
Gravitino was designed to solve those issues. It was built with Java, but supports Python clients.
The use cases for Gravitino include multi-cloud data consolidation, Du says. One of Datastrato’s customers is among the largest internet technology companies in the United States.
“They have tons of data,” he says, including a lot of abstracted data. “The data is distributed on-prem and to public clouds. So their compute resources, especially a GPU resource, are distributed over, you know, several clouds and regions. They want the same data, right? It’s available for all these kinds of clouds and regions, so then they can trigger the training jobs or inference jobs or their applications anywhere.”
Therefore, “A unified data catalog is very critical, right in this case, to make sure all this data is secure and consistent right across all the locations.”
Check out the full episode to learn more about Gravitino’s use cases, how it fits into the existing commercial and open source tooling landscape, and why the project’s founders decided to donate it to the Agentic AI Foundation.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don’t miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.