Collibra

Unstructured data classification – how to do it at scale

Discovering unstructured data is just the first step in managing it. The next is proper classification – here’s how to do it at scale.

Karolina Fox

10 min read

Published on: Jan 26, 2026

A woman looking through files illustrating an article on Unstructured data classification

Key takeaways

Unstructured data classification bridges the gap between discovery and operational governance, not just visibility.
Classification strengthens data security, boosts data quality, and helps organizations stay compliant by revealing what’s sensitive and where it lives.
AI and machine learning are essential – but only when embedded in a clear governance framework.
Collibra + Ohalo X‑Ray enable scalable classification when aligned with the organization’s operating model.
Getting classification right sets the stage for the final step: building a trustworthy unstructured data catalog (which we’re going to talk about in the next piece).

Unstructured data shows up everywhere in an organization – documents, PDFs, chat messages, screenshots, presentations, emails, social media posts, and files tucked away in forgotten drives. It’s one of the richest sources of insight companies have, yet also one of the most overwhelming to manage.

If you read the previous article on discovering unstructured data, you already know how complex this landscape is. But discovery is only the beginning. Once you know where all this content lives, the next challenge is figuring out what it actually is. So, let’s talk about unstructured data classification.

Understanding unstructured data, and why governance teams lose sleep over it

A quick reminder (especially if you haven’t read the previous piece about unstructured data discovery).

If you’ve ever opened a shared folder and found 12 versions of a document, three screenshots with identical names, and a PDF of a document someone photographed sideways, you’ve seen unstructured data chaos in action.

Unstructured data includes:

text data in documents, PDFs, and emails,
images and scans,
presentations,
chats and messaging threads,
social media posts,
pretty much anything that isn’t sitting in a neat table inside a database.

Unlike structured data, unstructured data has no predefined schema. It grows quietly across cloud platforms, shared drives, collaboration tools, and customer‑facing applications, often without ownership, policy enforcement, or visibility. Not to mention, no neat columns, fields, or labels. Even basic data management would take literal years if done manually with the number of unstructured files companies tend to have.

For data governance teams, the real problem, though, is uncertainty – and risk:

Sensitive data buried in the wrong places
No clear way to apply retention or access policies
Compliance exposure discovered only during audits or incidents
Manual remediation efforts that don’t scale.

Without classification, unstructured data remains a black box – rich in potential value, but extremely high‑risk to operate.

What unstructured data classification actually means in practice

To quote Ohalo’s LinkedIn page:

“Unstructured data isn’t outside governance. It’s just been outside your data catalog.”

And classification is the next step to get it into your data catalog, after it’s been successfully discovered.

Essentially, you want to give your content a structure it didn’t originally have. You’re not changing the files – you’re changing how you understand them.

In governance terms, classification answers four critical questions:

What is this data?
Why does it exist?
How sensitive is it?
Which policies apply to it?

But with unstructured data, the process is more nuanced, and it requires more than keyword matching. A typical, scalable classification process looks something like this:

Discover unstructured data across shared drives, cloud storage, collaboration platforms, and legacy repositories.
Analyze content using AI, natural language processing, and machine learning to understand context and meaning.
Apply classification labels to the metadata aligned to the organization’s data classification policy (e.g., public, internal, confidential, regulated). This is where you’ll also classify sensitive data.
Trigger governance actions such as access controls, retention rules, stewardship workflows, and security controls.

When classification is consistent and policy‑aligned, you can:

Identify sensitive information before it becomes a liability
Apply controls proactively instead of reactively
Support audits with evidence, not assumptions
Make unstructured data usable without increasing risk

The role of AI, and where governance discipline is still required

Manually reviewing unstructured data at enterprise scale is simply unrealistic. It would literally take a lifetime (or several, and we’re not even exaggerating). AI and machine learning are what make modern classification possible.

They recognize patterns across huge volumes of data.
They learn from past classification decisions.
They identify sensitive data automatically, even when the wording is not uniform.
They understand the context in text data, thanks to natural language processing.
They reduce errors, improving overall data quality.

AI-driven classification tools can look at a contract, an HR file, an email thread, or a scanned form and understand what’s going on in a way rules-based systems simply can’t.

And because those tools operate at machine speed, they can analyze massive amounts of unstructured data long before a human could make it through a single folder.

There is a but, though.

AI does not replace governance design.

Without clear classification definitions, validation workflows, and accountability, AI outputs will quickly become inconsistent, untrusted, or ignored.

Classification as the foundation of security and compliance

From a data governance perspective, classification is the control point that enables everything else.

Once you know what data is sensitive, regulated, or business‑critical, you can:

Enforce least‑privilege access
Apply retention and deletion rules
Support GDPR, DORA, and records‑management requirements
Reduce breach impact by limiting exposure

Classification also exposes data that should not exist at all – duplicates, outdated files, and forgotten repositories that silently increase risk (and storage costs).

Collibra + Ohalo X-Ray: a unified solution for unstructured data classification

Classifying unstructured data at scale requires both technical capability and governance context.

When you use an unstructured data discovery and classification platform like Ohalo Data X-Ray in tandem with a data governance tool like Collibra, the process becomes streamlined, highly efficient, and remains connected to the business context.

How Ohalo X-Ray works

Ohalo X-Ray scans unstructured data sources – documents, PDFs, images, text data, and more – and analyzes them using AI, natural language processing, and machine learning algorithms.

Its capabilities include:

identifying sensitive information and even automated redacting of PII files
detecting patterns and meaning in content
recognizing data types and aspects of data
categorizing content automatically
working across on-prem and cloud-based data sources

How Ohalo Data X-Ray works. Source: Ohalo

Data X-Ray first discovers and then classifies the data using automatic token analysis at a whopping rate of 100,000s words per second. Then it uses generative AI to contextualize the files.

How Collibra operationalizes the classification results

Once X-Ray applies classification labels, Collibra brings the governance layer:

mapping results to business terms
reviewing and validating classified data
applying governance workflows
improving data quality
managing storage, retention, and stewardship processes

Collibra puts the classification into context, helping people understand not just what the data is, but how it fits into the organization’s data lifecycle. We’ll talk more about that in the final part of the unstructured data series on cataloging unstructured data.

How does that work? Data X-Ray supports native and bespoke connectors and 100s of file types. It auto-generates metadata to link and maintain the Data Catalog and automatically propagates business glossary terms, data asset descriptions, and hierarchies defined within Collibra.

Implementation reality check: where classification initiatives fail

Most unstructured data classification initiatives fail because of execution gaps, not the technology. Again, this is where we need to go back to the drawing board, because common failure points include:

Classification labels that don’t align with enterprise policies
No ownership for validation and exception handling
Over‑automation without data steward oversight
Disconnected tooling that doesn’t integrate cleanly with Collibra’s metamodel
Inconsistent rollout across business units

For a Data Governance Manager in an organization, failures like this will translate directly into personal and organizational risk: missed deadlines, audit findings, and low adoption. And as usual, it’s easier to prevent them than fix them later.

Practical examples of how organizations classify unstructured data

There are three key use cases for unstructured data classification that come up most often:

Regulatory records and file retention – to understand your sensitive data and make sure your organization complies with relevant regulatory requirements. Unstructured data is particularly notorious for introducing chaos in this area.
System and on-prem to cloud migrations. Because unstructured data lacks a predefined data model, successful migrations are particularly challenging. By uncovering stale data or legacy files, you not only cover regulatory compliance needs, but also prevent inflated storage costs. (Not to mention, making it ready for cataloging.)
AI governance. According to Collibra, managing data remains the top challenge for 72% of businesses that want to scale AI. Classifying and managing unstructured data for AI is a particular challenge these days, with different metadata needed for humans and AI agents, complicating it even further.

A Murdio case study in unstructured data classification

Here’s our own example from one of our recent client projects in the banking industry, where classification was part of a wider unstructured data management project.

Our client had unstructured, critical data hidden across PDFs, emails, and file shares, with severely limited visibility into PII and confidential information. They also used Collibra as a data catalog, but it couldn’t natively scan and classify content inside PDF files.

This is exactly why we implemented the Collibra + Ohalo Data X-Ray integration, enabling automated discovery, classification, and governance of unstructured data:

Ohalo Data X-Ray used OCR (Optical Character Recognition) and AI to scan PDFs, emails, images, and network drives.
Detected entities were mapped to data classes/terms (e.g., PII categories) and synced into the Collibra Data Catalog as findings and technical assets, then mapped to business categories and physical locations.
This provided end-to-end visibility across documents and systems (who/where/what), enabling policy and remediation.

To learn more, read the entire case study: Discovering, classifying and cataloging unstructured data for a European bank.

Why classification is essential before cataloging

If discovery tells you where your data lives, and classification tells you what it is, cataloging is what finally makes unstructured data usable across the business.

But cataloging doesn’t work without classification first. You need:

categories
sensitivity labels
metadata
context
connections to business definitions

Classification provides the structure that cataloging relies on. Without it, you’d end up with a catalog full of files and no real way to understand or trust them.

Murdio tip: Classification only works when aligned with an organization’s data governance framework, policies, and operating model – not as a standalone technical exercise.

Preparing for the next stage: cataloging unstructured data

Once classification is complete, organizations can move on to the final step: building a searchable, governed catalog of unstructured data.

Our next article will dive into how Collibra brings everything together to turn unstructured content into a fully governed, business-ready asset.

For now, remember this: classification makes unstructured data understandable. Once you understand it, you can secure it, govern it, and actually put it to work.

And if you’re struggling with unstructured data management in your organization, our data governance experts are here to help.

See all

16 June 2026
| Collibra

The ultimate Data Manager’s guide to Collibra and Google Cloud (GCP)
29 April 2026
| Collibra

Collibra implementation in Banking: What large financial institutions get right (and what they don’t)
31 March 2026
| Collibra

Collibra Unstructured AI: Making unstructured data AI-ready

Unstructured data classification – how to do it at scale

Key takeaways

Understanding unstructured data, and why governance teams lose sleep over it

What unstructured data classification actually means in practice

The role of AI, and where governance discipline is still required

Classification as the foundation of security and compliance

Collibra + Ohalo X-Ray: a unified solution for unstructured data classification

How Ohalo X-Ray works

How Collibra operationalizes the classification results

Implementation reality check: where classification initiatives fail

Practical examples of how organizations classify unstructured data

A Murdio case study in unstructured data classification

Why classification is essential before cataloging

Preparing for the next stage: cataloging unstructured data

About the Author

Karolina Fox

Related Articles

The ultimate Data Manager’s guide to Collibra and Google Cloud (GCP)

Collibra implementation in Banking: What large financial institutions get right (and what they don’t)

Collibra Unstructured AI: Making unstructured data AI-ready

Let’s talk Possibilities

Unstructured data classification – how to do it at scale

Key takeaways

Understanding unstructured data, and why governance teams lose sleep over it

What unstructured data classification actually means in practice

The role of AI, and where governance discipline is still required

Classification as the foundation of security and compliance

Collibra + Ohalo X-Ray: a unified solution for unstructured data classification

How Ohalo X-Ray works

How Collibra operationalizes the classification results

Implementation reality check: where classification initiatives fail

Practical examples of how organizations classify unstructured data

A Murdio case study in unstructured data classification

Why classification is essential before cataloging

Preparing for the next stage: cataloging unstructured data

About the Author

Share this article

The ultimate Data Manager’s guide to Collibra and Google Cloud (GCP)

Collibra implementation in Banking: What large financial institutions get right (and what they don’t)

Collibra Unstructured AI: Making unstructured data AI-ready

Let’s talk Possibilities