Key takeaways
- Unstructured data classification bridges the gap between discovery and operational governance, not just visibility.
- Classification strengthens data security, boosts data quality, and helps organizations stay compliant by revealing what’s sensitive and where it lives.
- AI and machine learning are essential – but only when embedded in a clear governance framework.
- Collibra + Ohalo X‑Ray enable scalable classification when aligned with the organization’s operating model.
- Getting classification right sets the stage for the final step: building a trustworthy unstructured data catalog (which we’re going to talk about in the next piece).
Unstructured data shows up everywhere in an organization – documents, PDFs, chat messages, screenshots, presentations, emails, social media posts, and files tucked away in forgotten drives. It’s one of the richest sources of insight companies have, yet also one of the most overwhelming to manage.
If you read the previous article on discovering unstructured data, you already know how complex this landscape is. But discovery is only the beginning. Once you know where all this content lives, the next challenge is figuring out what it actually is. So, let’s talk about unstructured data classification.
Understanding unstructured data, and why governance teams lose sleep over it
A quick reminder (especially if you haven’t read the previous piece about unstructured data discovery).
If you’ve ever opened a shared folder and found 12 versions of a document, three screenshots with identical names, and a PDF of a document someone photographed sideways, you’ve seen unstructured data chaos in action.
Unstructured data includes:
- text data in documents, PDFs, and emails,
- images and scans,
- presentations,
- chats and messaging threads,
- social media posts,
- pretty much anything that isn’t sitting in a neat table inside a database.
Unlike structured data, unstructured data has no predefined schema. It grows quietly across cloud platforms, shared drives, collaboration tools, and customer‑facing applications, often without ownership, policy enforcement, or visibility. Not to mention, no neat columns, fields, or labels. Even basic data management would take literal years if done manually with the number of unstructured files companies tend to have.
For data governance teams, the real problem, though, is uncertainty – and risk:
- Sensitive data buried in the wrong places
- No clear way to apply retention or access policies
- Compliance exposure discovered only during audits or incidents
- Manual remediation efforts that don’t scale.
Without classification, unstructured data remains a black box – rich in potential value, but extremely high‑risk to operate.
What unstructured data classification actually means in practice
To quote Ohalo’s LinkedIn page:
“Unstructured data isn’t outside governance. It’s just been outside your data catalog.”
And classification is the next step to get it into your data catalog, after it’s been successfully discovered.
Essentially, you want to give your content a structure it didn’t originally have. You’re not changing the files – you’re changing how you understand them.
In governance terms, classification answers four critical questions:
- What is this data?
- Why does it exist?
- How sensitive is it?
- Which policies apply to it?
But with unstructured data, the process is more nuanced, and it requires more than keyword matching. A typical, scalable classification process looks something like this:
- Discover unstructured data across shared drives, cloud storage, collaboration platforms, and legacy repositories.
- Analyze content using AI, natural language processing, and machine learning to understand context and meaning.
- Apply classification labels to the metadata aligned to the organization’s data classification policy (e.g., public, internal, confidential, regulated). This is where you’ll also classify sensitive data.
- Trigger governance actions such as access controls, retention rules, stewardship workflows, and security controls.
When classification is consistent and policy‑aligned, you can:
- Identify sensitive information before it becomes a liability
- Apply controls proactively instead of reactively
- Support audits with evidence, not assumptions
- Make unstructured data usable without increasing risk
The role of AI, and where governance discipline is still required
Manually reviewing unstructured data at enterprise scale is simply unrealistic. It would literally take a lifetime (or several, and we’re not even exaggerating). AI and machine learning are what make modern classification possible.
- They recognize patterns across huge volumes of data.
- They learn from past classification decisions.
- They identify sensitive data automatically, even when the wording is not uniform.
- They understand the context in text data, thanks to natural language processing.
- They reduce errors, improving overall data quality.
AI-driven classification tools can look at a contract, an HR file, an email thread, or a scanned form and understand what’s going on in a way rules-based systems simply can’t.
And because those tools operate at machine speed, they can analyze massive amounts of unstructured data long before a human could make it through a single folder.
There is a but, though.
AI does not replace governance design.
Without clear classification definitions, validation workflows, and accountability, AI outputs will quickly become inconsistent, untrusted, or ignored.
Classification as the foundation of security and compliance
From a data governance perspective, classification is the control point that enables everything else.
Once you know what data is sensitive, regulated, or business‑critical, you can:
- Enforce least‑privilege access
- Apply retention and deletion rules
- Support GDPR, DORA, and records‑management requirements
- Reduce breach impact by limiting exposure
Classification also exposes data that should not exist at all – duplicates, outdated files, and forgotten repositories that silently increase risk (and storage costs).
Collibra + Ohalo X-Ray: a unified solution for unstructured data classification
Classifying unstructured data at scale requires both technical capability and governance context.
When you use an unstructured data discovery and classification platform like Ohalo Data X-Ray in tandem with a data governance tool like Collibra, the process becomes streamlined, highly efficient, and remains connected to the business context.
How Ohalo X-Ray works
Ohalo X-Ray scans unstructured data sources – documents, PDFs, images, text data, and more – and analyzes them using AI, natural language processing, and machine learning algorithms.
Its capabilities include:
- identifying sensitive information and even automated redacting of PII files
- detecting patterns and meaning in content
- recognizing data types and aspects of data
- categorizing content automatically
- working across on-prem and cloud-based data sources
How Ohalo Data X-Ray works. Source: Ohalo
Data X-Ray first discovers and then classifies the data using automatic token analysis at a whopping rate of 100,000s words per second. Then it uses generative AI to contextualize the files.
How Collibra operationalizes the classification results
Once X-Ray applies classification labels, Collibra brings the governance layer:
- mapping results to business terms
- reviewing and validating classified data
- applying governance workflows
- improving data quality
- managing storage, retention, and stewardship processes
Collibra puts the classification into context, helping people understand not just what the data is, but how it fits into the organization’s data lifecycle. We’ll talk more about that in the final part of the unstructured data series on cataloging unstructured data.
How does that work? Data X-Ray supports native and bespoke connectors and 100s of file types. It auto-generates metadata to link and maintain the Data Catalog and automatically propagates business glossary terms, data asset descriptions, and hierarchies defined within Collibra.
Implementation reality check: where classification initiatives fail
Most unstructured data classification initiatives fail because of execution gaps, not the technology. Again, this is where we need to go back to the drawing board, because common failure points include:
- Classification labels that don’t align with enterprise policies
- No ownership for validation and exception handling
- Over‑automation without data steward oversight
- Disconnected tooling that doesn’t integrate cleanly with Collibra’s metamodel
- Inconsistent rollout across business units
For a Data Governance Manager in an organization, failures like this will translate directly into personal and organizational risk: missed deadlines, audit findings, and low adoption. And as usual, it’s easier to prevent them than fix them later.
Practical examples of how organizations classify unstructured data
There are three key use cases for unstructured data classification that come up most often:
- Regulatory records and file retention – to understand your sensitive data and make sure your organization complies with relevant regulatory requirements. Unstructured data is particularly notorious for introducing chaos in this area.
- System and on-prem to cloud migrations. Because unstructured data lacks a predefined data model, successful migrations are particularly challenging. By uncovering stale data or legacy files, you not only cover regulatory compliance needs, but also prevent inflated storage costs. (Not to mention, making it ready for cataloging.)
- AI governance. According to Collibra, managing data remains the top challenge for 72% of businesses that want to scale AI. Classifying and managing unstructured data for AI is a particular challenge these days, with different metadata needed for humans and AI agents, complicating it even further.
A Murdio case study in unstructured data classification
Here’s our own example from one of our recent client projects in the banking industry, where classification was part of a wider unstructured data management project.
Our client had unstructured, critical data hidden across PDFs, emails, and file shares, with severely limited visibility into PII and confidential information. They also used Collibra as a data catalog, but it couldn’t natively scan and classify content inside PDF files.
This is exactly why we implemented the Collibra + Ohalo Data X-Ray integration, enabling automated discovery, classification, and governance of unstructured data:
- Ohalo Data X-Ray used OCR (Optical Character Recognition) and AI to scan PDFs, emails, images, and network drives.
- Detected entities were mapped to data classes/terms (e.g., PII categories) and synced into the Collibra Data Catalog as findings and technical assets, then mapped to business categories and physical locations.
- This provided end-to-end visibility across documents and systems (who/where/what), enabling policy and remediation.
To learn more, read the entire case study: Discovering, classifying and cataloging unstructured data for a European bank.
Why classification is essential before cataloging
If discovery tells you where your data lives, and classification tells you what it is, cataloging is what finally makes unstructured data usable across the business.
But cataloging doesn’t work without classification first. You need:
- categories
- sensitivity labels
- metadata
- context
- connections to business definitions
Classification provides the structure that cataloging relies on. Without it, you’d end up with a catalog full of files and no real way to understand or trust them.
Murdio tip: Classification only works when aligned with an organization’s data governance framework, policies, and operating model – not as a standalone technical exercise.
Preparing for the next stage: cataloging unstructured data
Once classification is complete, organizations can move on to the final step: building a searchable, governed catalog of unstructured data.
Our next article will dive into how Collibra brings everything together to turn unstructured content into a fully governed, business-ready asset.
For now, remember this: classification makes unstructured data understandable. Once you understand it, you can secure it, govern it, and actually put it to work.
And if you’re struggling with unstructured data management in your organization, our data governance experts are here to help.
