Security is baked into the architecture. The solution is containerized and deployed within your own Private Virtual Cloud (VPC). This means your sensitive documents never leave your environment to be indexed or learned by a public model. Only the extracted metadata and text chunks are stored in a local Postgres database under your control.
With the recent acquisition of Deasy Labs, Collibra is now offering a new module for unstructured data governance. Here’s all you need to know about Collibra Unstructured AI.
Unstructured data has long been the “dark matter” of the enterprise – present everywhere, yet notoriously difficult to track or control. For years, organizations have struggled with sprawling SharePoint folders and dumping grounds of PDFs, Word docs, screenshots, and call transcripts. And it doesn’t look like that’s going to change any time soon.
Which means – as an organization, you need to have a process in place to account for unstructured data. Especially when it also feeds AI models. As generative AI moves from experimental labs to production environments, managing this kind of data is an absolute must.
Collibra’s acquisition of Deasy Labs and the launch of its Unstructured AI is a way to do it, helping organizations handle the enormous amount of corporate data that traditionally lives outside of databases.
Key takeaways
- Manual data discovery is the biggest bottleneck in AI projects. Collibra Unstructured AI automates the work of filtering and tagging, turning weeks or even months of manual labor into hours of automated ingestion.
- Unlike traditional catalogs that only look at file names, the system uses LLMs to read documents, extracting specific tags, PII, and custom taxonomies at the chunk level.
- The output is a data slice – a curated, governed set of data ready to be pushed directly into vector databases for RAG applications.
- The solution is containerized and deployed within the customer’s private VPC, ensuring sensitive data never leaves the corporate perimeter while being processed.
- The roadmap points toward a single layer where structured and unstructured data quality are managed under the same Collibra standards.
The problem with unstructured data
Unstructured data constitutes 70% to 90% of data in organizations today, according to Gartner. So far, it has been virtually impossible to catalog metadata from sources like emails, PDFs, call transcriptions, etc., using standard connectors. You could only do it manually, which is obviously a problem when it comes to the scale – and especially for AI programs.
Modern AI projects are frequently delayed by four to six weeks simply because teams are manually sifting through dumping grounds of old files to find relevant context for a RAG system. And that’s just the beginning.
Most unstructured data can’t be used because of low quality, poor discoverability, and a lack of contextual metadata. They’re just objects you don’t know anything about – their structure, quality, importance, or even where to find them.
And there are two, or maybe even three sides to this risk:
- Compliance exposure – if you don’t know which files contain PII or sensitive legal terms, you can’t safely feed them into an LLM.
- “Garbage in, garbage out”, aka stale or irrelevant data leading to untrustworthy chatbots. If a customer service bot relies on a two-year-old policy PDF because it was the easiest to find, the entire AI initiative loses credibility.
- And when an AI initiative loses credibility while companies lose weeks digging through unstructured data, they lose something else, too – and that’s revenue opportunities.
The solution, aka how to structure unstructured data
There have been tools for unstructured data discovery and classification for a while, but the scale has been the problem.
Or as Reece Griffiths, the CEO of Deasy Labs, puts it:
“Our customers don’t have millions of files; they have billions of files. Everything we’re doing here has to scale to typically the billions of files or petabyte scale. Otherwise, those methods just break down immediately without having huge costs.”
Of course, there’s a reason we mention Deasy Labs, as the acquisition of this “context engine for unstructured data” is what enabled the creation of the latest Collibra module: Unstructured AI.
Here’s what Felix Van de Maele, Collibra Founder and CEO, said about the Deasy Labs acquisition:
“As organizations scale their use of AI, the ability to unlock the value of unstructured data becomes critical,” Deasy Labs gives us the ability to tag, filter, and enrich this dark data at scale, automatically turning unstructured files into structured, meaningful and trusted data assets ready for AI.”
And that’s exactly what Collibra Unstructured AI does. So, let’s take a closer look.

Deasy Labs working inside Collibra to extract metadata from unstructured data. Source: Collibra
How Collibra Unstructured AI works
Collibra Unstructured AI moves beyond extrinsic metadata (like file location, size, or author) and focuses on content-derived metadata (aka intrinsic metadata, e.g., document type, summary, business descriptions, or the presence of sensitive data).
And it follows a three-step pipeline:
1. Access and ingest
The system connects to sources like SharePoint, S3, or ADLS Gen2. Using OCR technology, it breaks files down into manageable chunks. It can even handle complex formats like multi-tab XMLs or tables buried within deep PDF structures.
2. Analyze (= the LLM magic)
This is where the transformation happens. Using a mix of LLM-based classification, pattern-based tagging (RegEx), and custom taxonomies, the system generates metadata at the file, page, and chunk level.
To save costs and improve accuracy, the system uses if-then logic. For example: “If this is identified as a financial report, only then look for specific financial policy IDs.”
3. Action: Creating knowledge products
The output is what Collibra calls a data slice or knowledge product – a curated set of structured metadata and chunks ready to be exported to a vector database, an SDK for data scientists, or back into the original SharePoint source to enrich the files themselves.
Here’s what the process looks like in detail:

How Unstructured AI processes unstructured data. Source: Collibra
Key use cases for Collibra Unstructured AI
Below are five primary use cases that have emerged where Collibra Unstructured AI delivers immediate ROI:
- AI input governance
Automate the validation and enrichment of unstructured data used in AI pipelines for responsible AI adoption and reduced regulatory risk. This also helps enforce strict metadata standards across the entire AI lifecycle.
- RAG and generative AI optimization
Use semantic metadata to improve retrieval-augmented generation (RAG) systems. By adding contextual tags, you enhance retrieval accuracy through smarter routing strategies and embedding enhancements.
- Enterprise search
Build structured datasets from unstructured files to support high-accuracy search applications. This prevents knowledge search degradation that typically happens when scaling to thousands or millions of documents.
- Data product curation
Curate data slices by leveraging metadata to filter millions of documents. Teams can then identify the most relevant subset for specific business or data science use cases in hours rather than months.
- Compliance and risk mitigation
Classify and tag unstructured content at scale to proactively detect sensitive, PII, or non-compliant content before it ever enters an AI workflow or a public-facing chatbot.
What’s more, according to Collibra, the business impact of the feature includes:
- A 3%-20% increase in total enterprise data usable for AI with a semantic tagging layer
- Time used to tag, organize, and filter a corpus of 10,000 files reduced from a month to 20 minutes
- 78%-92% increase in search accuracy when adding semantic metadata into enterprise search systems
The advantage of Collibra Unstructured AI
Traditionally, unstructured data discovery, classification, and cataloging in Collibra have been done with the help of Ohalo Data X-Ray. You can read more about this in detail in our article series on unstructured data:
Unstructured data discovery with Collibra and Ohalo Data X-Ray
Unstructured data classification – how to do it at scale
Unstructured data cataloging for AI and compliance
But while traditional tools often rely heavily on pattern-based tagging for compliance and cleanup, Collibra’s focus is on AI-readiness.
Collibra acts as a bridge, providing an SDK and direct vector database integration so developers can pull trusted, timely, and relevant data directly into their AI pipelines.
A standout feature of Collibra’s approach is its deployment model. The system is fully containerized and sits within the customer’s private cloud (VPC).
This way, data stays local, and the files themselves do not move. Only the extracted text and metadata are stored in a local database within the customer’s environment.
Of course, processing billions of files requires surgical precision. Collibra optimizes token usage by only sending relevant chunks to the LLM. If a 200-page document only has two pages relevant to a specific tag, only those two pages are processed, drastically reducing API costs.
Here’s a full list of benefits:
- Collibra Unstructured AI provides a single platform to govern both structured and unstructured data (like PDFs, emails, and transcripts), eliminating silos and providing a single source of truth.
- The system automatically turns dark data into structured, meaningful, and trusted assets by tagging, filtering, and enriching files at scale.
- It creates a high-quality knowledge base specifically designed for GenAI assistants, semantic search, and RAG pipelines.
- It replaces manual tagging and taxonomy work with AI-powered workflows that generate business-relevant metadata and model suggestions.
- By automating classification and enrichment, it eliminates the need for expensive manual labeling and custom AI pipelines, accelerating time to insight.
- It enables high-performance enterprise search across massive volumes of data by adding contextual metadata to unstructured content.
- The platform automatically identifies and filters sensitive information, so that AI models only access governed and compliant data.
- It prevents knowledge decay by continuously updating metadata as files change, so that AI responses remain accurate and reliable over time.
- It allows teams to rapidly scan file repositories to identify relevant content using semantic tagging, which speeds up the delivery of new AI use cases.
And here’s a quick comparison between Collibra’s new tool and traditional tools used for unstructured data discovery:
Collibra Unstructured AI vs. Traditional discovery tools
| Feature | Traditional Discovery (e.g., Ohalo, BigID) | Collibra Unstructured AI |
| Primary Goal | Risk reduction & PII detection. | AI-readiness & knowledge creation. |
| Analysis Method | Pattern-matching (RegEx) & Basic NLP. | AI-native classification: LLM-native tagging for richer, customizable metadata beyond RegEx. |
| Platform Integration | Compliance silos. | Unified platform: Unites structured and unstructured data across catalog, marketplace, and quality (from Q1 2026). |
| Trust Mechanism | Automated “black box” scans. | Human-in-the-loop: Supports review and fine-tuning for accurate, auditable, and traceable metadata. |
| Consistency | Ad-hoc or siloed classification. | Governed metadata: Reusable taxonomies with automated workflows for continual scanning. |
What’s next for unstructured data?
The 2026 roadmap for Collibra Unstructured AI includes a unified experience where structured and unstructured data live in a single Collibra catalog, governed by the same standards and workflows.
And the next frontier for this technology is unstructured data quality. Just as we measure the health of an SQL table, we will soon be able to generate quality scores for SharePoint sites based on:
- Freshness: Is this the latest version of the contract?
- Duplication: How many copies of “Draft_v2” are cluttering the system?
- Completeness: Are the required tags and classification levels present?
From 6 weeks of searching to production-ready RAG in a fraction of the time
Collibra Unstructured AI helps turn a liability, which is unmanaged, unstructured data, into an asset – a knowledge product. And as we move toward a future where AI is only as good as the data it consumes, Collibra is providing the essential infrastructure, so organizations can be 100% sure that the data they discover can be trusted.
If you want to talk about managing unstructured data at your organization, our Collibra experts are happy to help.
Unstructured AI FAQs
Yes. The platform uses advanced OCR to extract text from images and scanned documents. It is specifically designed to handle messy data, such as complex tables in PDFs or multi-tab XML files, ensuring that the context of the data is preserved even when the format is difficult.
Not at all – it feeds them. Think of Collibra Unstructured AI as the “ETL for RAG.” It handles the discovery, cleaning, and chunking of your data, then provides an SDK or direct integration to push those high-quality, governed knowledge products into your vector database (like Pinecone, Milvus, or Weaviate).
While traditional tools are great at finding patterns (like a Social Security number) for compliance cleanup, Collibra is focused on AI-readiness. It uses LLMs to understand the meaning of the content. This allows you to create complex taxonomies and quality scores, making sure that your AI is compliant, accurate, and helpful.
