A data quality rule is a formal, documented specification that defines the conditions your data must meet to be considered accurate, complete, and fit for its intended purpose.
That definition matters because it draws a line most teams never draw. A data quality rule is not the same as a data quality check, a data quality policy, or a data quality metric – and confusing these three things is one of the most common reasons enterprise data quality programs stall.
Here is how they differ:
| Concept |
What it is |
Example |
| Data quality rule |
A formal specification – the condition data must satisfy |
“Every customer record must have a valid email address in RFC 5321 format” |
| Data quality check |
The technical implementation that tests a rule |
A SQL query or Python script that flags rows where email fails regex validation |
| Data quality policy |
A high-level governance principle that drives multiple rules |
“Customer data must be complete and accurate before use in any marketing campaign” |
| Data quality metric |
A measurement of rule compliance over time |
“98.7% of customer email addresses are currently valid” |
The rule sits in the middle. It translates a business policy into a precise, testable condition – and it gives the engineering team something concrete to implement as a check.
Think of it this way: a policy says what the organization requires. A rule says what a specific data element must satisfy. A check says how to test that. A metric says how well you are currently doing.
Most organizations have checks. Fewer have rules that are formally documented. Almost none have rules that are actively governed – with owners, thresholds, and a defined lifecycle. That gap is exactly where enterprise data quality programs break down.
Why the distinction between rules and checks matters
If your team treats rules and checks as the same thing, you will run into a predictable set of problems.
First, checks written without a formal rule behind them tend to be technically correct but practically arbitrary. An engineer writes a null check on a column because it seemed important – but there is no documentation explaining which business process depends on that column, what the acceptable failure rate is, or who should be notified when it fails. When the engineer leaves, institutional knowledge leaves with them.
Second, without documented rules, it is impossible to audit your data quality program. A regulator asking “how do you ensure the accuracy of your customer data?” needs an answer that goes deeper than “we have some SQL scripts in our pipeline.” Rules give you that answer.
Third, checks without rules cannot be prioritized. If every check is equally important, nothing is. A formal rule forces you to define severity – and severity determines whether a failure blocks a downstream process or just generates a warning.
The short version: checks are how you test data. Rules are what you are testing for – and why.
Key takeaways
- A data quality rule is a formal, documented specification that defines the conditions data must meet to be considered fit for purpose – separate from the technical check that tests it.
- Rules have six core types: completeness, validity, uniqueness, referential integrity, business/domain-specific, and timeliness.
- Every rule needs an owner, a threshold, a severity level, and a defined remediation path – without these four attributes, a rule is just a wish.
- Managing rules at enterprise scale requires a central repository, a formal lifecycle, and a governance platform that connects rules to the data assets they protect.
The anatomy of a data quality rule
Most organizations that struggle with data quality have the same underlying problem: their rules are incomplete. They define what a rule tests but skip everything else – who owns it, what a pass looks like, what happens when it fails. An incomplete rule cannot be governed, and an ungoverned rule cannot be trusted.
A well-formed data quality rule has nine attributes. Each one serves a specific purpose, and leaving any of them out creates a gap that will surface later – usually at the worst possible moment.
| Attribute |
What it defines |
Example |
| Rule name |
A unique, human-readable identifier following a consistent naming convention |
CUST_EMAIL_FORMAT_001 |
| Description |
What the rule verifies, in plain business language |
“The email address field for every active customer record must conform to RFC 5321 format” |
| Data domain |
The business area the rule belongs to |
Customer Master Data |
| Quality dimension |
Which data quality dimension the rule enforces |
Validity |
| Scope |
The exact table, column, or data asset being tested |
customers.email_address WHERE status = ‘active’ |
| Threshold |
The acceptable failure rate before the rule triggers an alert |
< 0.5% of records |
| Severity |
What happens when the threshold is breached |
Critical – blocks downstream campaign exports |
| Owner |
The person accountable for the rule – not the engineer who implements it, but the business stakeholder who needs it |
Data Steward, Marketing |
| Remediation |
The defined action when a failure occurs |
Quarantine affected records, alert data steward, log in issue tracker |
Why each attribute matters
- Rule name and description sound like administrative overhead. They are not. In a large enterprise with hundreds of rules spread across dozens of data domains, a name like check_email tells you nothing six months later. A convention like CUST_EMAIL_FORMAT_001 tells you the domain (customer), the type of check (format), the field (email), and the version – without opening any documentation.
- Threshold is where most teams make their first mistake. Setting it at 100% – zero tolerance for failures – sounds rigorous. In practice, it means your alerting system cries wolf constantly, your team starts ignoring alerts, and the one genuine failure that matters gets buried in noise. The right threshold is a business decision, not a technical one. It should reflect the actual tolerance of the downstream process that depends on this data.
- Owner is the attribute most often left blank. When a rule has no owner, it has no accountability. If the rule fails, nobody is responsible for investigating. If the underlying data model changes, nobody updates the rule. Ownership must be a named person – not a team, not a system, and not “TBD.”
“In almost every implementation we walk into, the rules exist in some form – in pipeline code, in Jira tickets, in someone’s head. What is almost never there is ownership. The moment you ask ‘who is responsible for this rule?’, the room goes quiet. That silence is the actual data quality problem.”
— Sebastian Chalot, Data Governance Consultant, Murdio
- Remediation is the attribute that turns a rule from a monitoring tool into an operational control. A rule that fires an alert but has no defined response path is only marginally better than no rule at all. The remediation step should answer three questions: what happens to the failing records, who is notified, and what is the expected resolution time.
Types of data quality rules
Not all data quality rules work the same way or protect the same thing. Grouping them by type gives your team a shared vocabulary, makes it easier to assign ownership, and helps you spot gaps in your coverage – domains or quality dimensions where you have no rules at all.
There are six core types. Most enterprise data quality programs need all of them.
| Rule type |
Quality dimension |
What it checks |
Example |
| Completeness rules |
Completeness |
Whether required fields contain a value |
Every active contract record must have a non-null counterparty_id |
| Validity rules |
Validity |
Whether values conform to a defined format, pattern, or set of accepted values |
Transaction currency must be a valid ISO 4217 code |
| Uniqueness rules |
Uniqueness |
Whether values that must be unique – such as primary keys or identifiers – contain duplicates |
No two records in clients may share the same tax_identification_number |
| Referential integrity rules |
Consistency |
Whether relationships between datasets are intact |
Every account_id in the transactions table must exist in the accounts table |
| Business rules |
Accuracy / Consistency |
Whether data satisfies domain-specific or regulatory conditions that go beyond format |
A loan application flagged as “approved” must have a non-null credit_score above the institution’s minimum threshold |
| Timeliness rules |
Timeliness |
Whether data is sufficiently fresh for its intended use |
The positions table must be updated within 15 minutes of market close |
How to read the table
The quality dimension column maps each rule type to the data quality dimension it primarily enforces. A completeness rule enforces completeness. A validity rule enforces validity. This mapping matters because it tells you which dimension your rule library is covering – and which it is not.
In practice, a large enterprise rule library will have completeness and validity rules in abundance, because they are the easiest to write. Timeliness rules and business rules tend to be underrepresented, because they require closer collaboration between data engineers and business stakeholders to define correctly. If your inventory of rules skews heavily toward completeness and validity, that is a signal worth paying attention to.
Business rules deserve special attention
Business rules are the most powerful type and the hardest to write well. Unlike the other five types – which are largely structural and can be defined by a data engineer working from a schema – business rules encode domain knowledge. They express conditions that only make sense in the context of a specific process, regulation, or industry.
In financial services, a business rule might require that any position exceeding a certain notional value has an associated hedge on record – a condition derived directly from risk policy, not from the data model. In insurance, a rule might state that a claim cannot be marked as settled if the associated policy has an active dispute flag. These conditions are invisible to anyone who does not understand the business.
This is why business rules require a different workflow. The data steward or business analyst defines the condition. A data engineer translates it into a testable specification. A compliance or risk owner signs off. Skipping any of those three steps produces rules that are either technically untestable or practically meaningless.
For organizations operating under EU regulatory frameworks – DORA, BCBS 239, or GDPR – a significant portion of your business rules will be driven directly by regulatory requirements. Those rules carry an additional attribute worth tracking: the specific regulatory article or control they map to. When an auditor asks which rules enforce your BCBS 239 data lineage obligations, you want to be able to answer in seconds, not weeks.
Navigating DORA, BCBS 239, or GDPR data quality requirements? We have helped financial institutions and large EU enterprises translate regulatory obligations into governed rule libraries. Let’s talk about your situation – no slides, just a practical conversation.
The data quality rule lifecycle
A data quality rule is not a one-time artifact. It has a lifecycle – from the moment a need is identified to the moment the rule is retired. Managing that lifecycle deliberately is what separates organizations with a functioning data quality program from those with a graveyard of stale checks nobody maintains.
The lifecycle has seven stages.
1. Define
The process starts with identifying a specific data quality need – a failed audit, a broken report, a regulatory requirement, or a proactive gap analysis. The data steward or business analyst documents the rule using the nine-attribute template from the previous section. At this stage, the rule exists as a specification only. No code has been written.
The most common mistake at this stage is skipping scope. A rule defined as “email addresses must be valid” is not actionable. A rule defined as “the email_address field in customers where status = ‘active’ must conform to RFC 5321 format, with a failure threshold of 0.5%” is.
2. Validate
The draft rule is reviewed jointly by the business owner and a data engineer. The business owner confirms the condition reflects the actual business requirement. The data engineer confirms the rule is technically testable against the available data model. If either review fails, the rule goes back to definition.
This step exists to catch two failure modes early: rules that are business-correct but untestable, and rules that are technically clean but describe the wrong condition.
3. Approve
The validated rule goes through formal sign-off – typically the data governance council, a CDO, or the designated data owner for that domain. Approval creates an audit trail. It also forces a prioritization decision: what severity level does this rule carry, and what resources are allocated to monitor and remediate it?
In organizations with a data governance platform, approval is tracked directly in the platform against the rule record.
4. Deploy
An approved rule is implemented as one or more data quality checks – the SQL queries, Python scripts, or platform-native tests that actually run against the data. The check is the technical execution of the rule. One rule may produce several checks: a completeness check, a format check, and a range check can all flow from a single well-defined business rule.
At deployment, the rule is linked to the data asset it protects in the governance catalog. This linkage is what makes the rule traceable – you can navigate from any data asset to the rules that govern it, and from any rule to the assets it covers.
5. Monitor
Once deployed, the rule runs on a defined schedule and produces a pass rate over time. Monitoring is not just about catching failures – it is about tracking trends. A rule with a 99.2% pass rate today that was at 99.8% three months ago is telling you something. The degradation may be slow enough to stay below your alert threshold while still signaling a structural problem upstream.
Effective monitoring requires a dashboard that shows pass rates per rule, per domain, and per dimension – not just raw alert counts. This is covered in more detail in the data quality report guide.
6. Review
Rules should be reviewed on a regular cadence – at minimum annually, and immediately when the underlying data model or business process changes. A rule written against last year’s schema may be silently passing because it is testing a column that no longer exists or has been renamed.
Review questions to ask for each rule:
- Is the condition still accurate relative to the current business process?
- Is the threshold still appropriate, given observed pass rates?
- Is the owner still the right person?
- Has the scope changed – new tables, new systems, new data flows?
7. Retire
Rules have an end of life. A rule tied to a decommissioned system, a discontinued product line, or a superseded regulation should be formally retired – not deleted, but marked as inactive with a documented reason and date. Deletion destroys audit history. Retirement preserves it.
Retired rules also carry institutional value. When a new team member asks why a certain check exists, a trail of retired rules from the same domain tells the story of how the data quality program evolved.
The lifecycle as a governance artifact
The seven stages above are not just a workflow – they are the evidence trail that demonstrates your data quality program is real and operational. Each transition between stages should be timestamped and attributed to a named person. That record is what you present to a regulator, an auditor, or a new CDO who wants to understand the state of data governance in the organization.
Organizations that skip the lifecycle – writing checks directly in pipelines without a defined-approved-monitored rule behind them – have data quality testing. They do not have data quality governance. The distinction matters more than most teams realize until they face an audit.
Not sure where your program stands? If you have checks but no governed lifecycle behind them, that gap is fixable – but it helps to map it first. Talk to a Murdio expert about how to approach it in your environment.
Managing data quality rules at enterprise scale
A single well-formed rule with a named owner, a defined threshold, and a monitored check is straightforward to manage. The challenge arrives at scale – when a large enterprise has hundreds of data domains, dozens of source systems, and thousands of rules spread across teams in multiple countries.
At that point, a spreadsheet stops working. Not because spreadsheets are inherently bad tools, but because rule management at scale has requirements that spreadsheets cannot satisfy: version control, cross-domain search, lineage linkage, workflow routing, and real-time pass rate tracking. Attempting to do all of that in Excel produces a rule inventory that is always slightly out of date, always slightly wrong, and trusted by nobody.
What enterprise rule management actually requires
The operational requirements for managing rules at scale break down into five categories:
- A central rule repository – a single place where every rule is documented, searchable, and linked to the data asset it governs. Without this, rules live in pipeline code, Confluence pages, and individuals’ heads. You cannot audit what you cannot find.
- Ownership and stewardship assignment – rules must be assigned to named individuals, not teams or roles. When the organizational structure changes – and it will – ownership needs to be actively reassigned, not lost. The repository should surface rules with no current owner as a first-class operational alert.
- Version history – rules change. Thresholds get adjusted, scopes change as data models evolve, conditions are refined after a business process update. Every change should be versioned and attributed. If a rule was passing at 99.5% last quarter and is now failing at 94%, the version history tells you whether something in the data changed or whether someone loosened the threshold.
- Cross-environment propagation – rules defined and tested in a development environment need to follow the data into staging and production. In organizations without a governed propagation process, production environments routinely have fewer rules than dev – because deployments happen faster than rule documentation.
- Consolidated reporting – individual rule pass rates are operational data. Aggregated across a domain, a dimension, or a regulatory framework, they become strategic intelligence. A CDO needs to know not just that a single rule is failing, but that completeness coverage across the customer domain has degraded by 3 percentage points over two quarters.
How Collibra handles rule management
Collibra treats data quality rules as first-class citizens in its data catalog. Rules are stored as assets – the same way tables, columns, and reports are stored – which means they carry full metadata, ownership assignments, and relationship linkages.
The practical implication is that a data steward navigating a data asset in Collibra can see, directly in the asset view, which rules govern that asset, what their current pass rates are, and who owns them. The rule is not a separate artifact maintained in a separate system – it is part of the same knowledge graph as the data itself.
Collibra DQ – the platform’s dedicated data quality module, built on the Owl Analytics acquisition – adds the execution layer. Rules defined in the catalog are operationalized as DQ jobs that run against the actual data on a defined schedule. Results feed back into the catalog in real time, updating pass rates and triggering workflow notifications when thresholds are breached.
For organizations managing regulatory obligations under frameworks like BCBS 239 or DORA, this linkage between rules, assets, and lineage is particularly valuable. When a regulator asks for evidence that your critical data elements are governed by defined quality rules with documented owners and monitored pass rates, Collibra provides that evidence as a native export – not as a manually assembled spreadsheet.
This is also where the data quality automation story connects to governance. Automation without a rule library is a pipeline with tests. Automation with a governed rule library – where every check traces back to a documented, approved, owned rule – is a data quality program.
Scaling from 10 rules to 1,000
Most organizations do not start with a complete rule inventory. They start with the ten or twenty most critical rules – typically the ones tied to a specific audit finding, a failed report, or a regulatory deadline. That is the right approach. Trying to define all rules for all domains before implementing anything produces analysis paralysis.
The practical path looks like this:
- Start with a pilot domain – pick one data domain that is high-visibility and has a willing business owner. Define rules end-to-end: documented, approved, deployed, monitored. Use it as a template for every subsequent domain.
- Prioritize by downstream impact – rules that protect data feeding into regulatory reports, financial consolidation, or AI training sets should come before rules for data that is rarely used. Impact determines priority.
- Reuse rule patterns – completeness and validity rules for similar data types follow predictable patterns. A naming convention and a rule template library mean that writing the hundredth rule takes a fraction of the time the first one did.
- Track coverage as a metric – the percentage of critical data elements covered by at least one active, monitored rule is itself a data quality KPI. It should appear on the same dashboard as pass rates.
Organizations that follow this path consistently report the same outcome: the first domain takes months, the second takes weeks, the third takes days. The bottleneck is never the tooling – it is the process of aligning business and technical stakeholders on what “good data” actually means. Solve that once, and the rest scales.
“The teams that struggle with Collibra are usually the ones that tried to configure the platform before defining their rules. You cannot automate governance you have not designed. We always start with a single domain, a whiteboard, and the business owner in the room – not with the tool.”
— Sebastian Chalot, Collibra Implementation Lead, Murdio
Common mistakes when defining data quality rules
Most data quality programs fail not because the technology is wrong, but because the rules themselves are poorly constructed. These are the mistakes that appear most consistently across enterprise implementations.
1. Defining rules without a business owner
A rule written by a data engineer in isolation reflects a technical judgment, not a business requirement. Without a named business owner who has confirmed the condition and accepted accountability for failures, the rule has no authority. When it fires, there is nobody to decide whether the result constitutes a real problem or acceptable noise. Every rule needs a business owner at the point of definition – not after deployment.
2. Setting thresholds at 100%
Zero tolerance sounds disciplined. In practice it is a recipe for alert fatigue. Real data in large enterprises never achieves 100% compliance across all rules simultaneously. When every alert is a critical failure, teams learn to ignore the alerting system entirely – and miss the genuine failures buried in the noise. Thresholds should reflect the actual tolerance of the downstream process, set through a deliberate conversation between the data owner and the business stakeholder.
3. Writing rules with no scope
“Email addresses must be valid” is not a rule – it is a principle. A rule requires a precise scope: which table, which column, under which conditions, in which environment. A rule without scope cannot be implemented unambiguously, and it cannot be audited. Two engineers given the same scopeless rule will often implement different checks.
4. Treating rules and checks as the same thing
When a team writes checks directly in a pipeline without a formal rule behind them, the check exists but the governance does not. There is no documented condition, no owner, no threshold, no approval trail. The check may be technically correct and still be completely invisible to any audit or compliance review. Rules and checks need to exist as separate, linked artifacts.
5. No defined remediation path
A rule that detects a failure and stops there has done half the job. If the response to a failure is “someone will figure it out,” the failure will either be ignored or handled inconsistently across different teams. The remediation path – quarantine, alert, escalate, reject, flag for manual review – should be defined at the time the rule is approved, not improvised when the first failure occurs.
6. Not versioning rules
Rules change. A threshold adjusted after a review, a scope expanded to cover a new data source, a condition updated to reflect a regulatory change – each of these is a material modification. Without version history, there is no way to determine whether a change in pass rate reflects a change in data quality or a change in the rule itself. Version control for rules is not optional in a governed environment.
7. Letting rules go stale
A rule written against a data model that no longer exists will either always pass – because the column it tested has been renamed and is now untested – or always fail – because a value set has changed and the rule was never updated. Stale rules are worse than no rules, because they create a false sense of coverage. Rules need to be reviewed on a regular cadence and updated whenever the underlying data model or business process changes.
Recognise a few of these in your own program? Most teams do. The good news is that none of them require starting over – they require the right approach. Talk to us about what fixing this looks like in practice with Collibra.
Frequently asked questions