01 01
1970
In the world of data, these checks are your first line of defense against common data quality issues. Costly business mistakes, skewed analytics, and failed AI projects often stem from inaccurate data and a failure to maintain data integrity.
While understanding concepts like data quality frameworks and metrics is crucial, this guide is a hands-on manual focused on one thing: the specific, practical data quality checks you can implement today. We’ll provide the ‘what,’ ‘why,’ and ‘how’ for each essential check, complete with code you can adapt for your own projects.
We’ll cover the most common types of checks with real-world examples to help you find and fix data errors before they cause damage. If you need a refresher on the fundamentals on data quality, we’ll link you to our dedicated guides along the way, but for now, let’s get our hands dirty and start building trust in your data.
Link to: What is Data Quality?
Before you start writing code, it’s crucial to understand what you’re looking for. This often starts with initial data profiling to get a high-level overview of your dataset. A data quality check isn’t a random test; it’s a targeted question you ask to verify your data adheres to defined quality standards.
This section provides the foundational context you need to perform checks that are both effective and efficient.
Every data quality check is designed to test against a core standard or “dimension” of quality.
Think of these as the categories for your tests. While the theory runs deep, you only need to know the basics to get started:
These dimensions give your checks a clear purpose. Instead of just “checking the data,” you’re asking a specific question like, “Is the order_date column valid?” or “Is the customer_id field unique?”
To explore the theory and specific metrics behind each dimension, see our detailed guide.
Running the right check at the wrong time is ineffective. To maximize your impact, it’s best to align your checks with the data’s journey through your systems. This ensures you catch errors at the most logical and cost-effective point.
Stage | Purpose & importance | Common checks performed |
At ingestion | Act as the first gatekeeper. The goal is to prevent data errors at the source, stopping malformed or unexpected data from ever entering your data warehouse. |
|
During transformation | Ensure the accuracy of business logic. This is where most detailed checks occur after data has been cleaned, joined, and modeled. |
|
Before serving | Act as the final quality gate. This protects BI dashboards, ML models, and end-users from consuming inaccurate or stale information. |
|
This is the heart of the guide – a practical playbook of the ten most critical and effective data quality checks.
Each entry follows a simple What, Why, and How format, complete with a copy-paste code snippet in either SQL or Python with Pandas.
Use these examples as a starting point to build a robust testing suite for your most important data assets.
What: This is the most fundamental check. It scans a column for NULL or empty values to ensure no critical information is missing.
Why: A missing email prevents you from contacting a customer; a missing sales figure breaks your financial reports. Ensuring the completeness of essential data is the first step toward building trust.
How (SQL): This query counts the number of rows in your orders table where the customer_id – a critical field for linking to customer information – is missing.
SQL
— Find rows where a critical column is empty
SELECT count(*)
FROM orders
WHERE customer_id IS NULL;
What: This check verifies that all values in a column that should be unique, like a primary key or an order ID, have no duplicates.
Why: Dealing with duplicate data can severely skew your metrics, leading to inflated customer counts, inaccurate sales figures, and broken downstream processes that rely on a unique identifier for joins.
How (SQL): This query groups your orders table by order_id and then uses a HAVING clause to find any IDs that appear more than once.
SQL
— Find duplicate order IDs that should be unique
SELECT
order_id,
count(*)
FROM orders
GROUP BY order_id
HAVING count(*) > 1;
What: This check uses a regular expression (RegEx) to ensure that data in a field conforms to a standard pattern, like an email address or a phone number.
Why: Data that isn’t in the correct format is often unusable. An email address without an “@” symbol is just a useless string of text, and a date in the wrong format can cause your application or analytics query to fail.
How (Python with Pandas): This snippet finds all rows in a DataFrame where the email column doesn’t match the standard local-part@domain email format.
Python
# Find email addresses that DON’T match the standard format
invalid_emails = df[~df[’email’].str.contains(r'[^@]+@[^@]+\.[^@]+’, na=False)]
print(invalid_emails)
What: This check ensures that a value in one table (a foreign key) has a corresponding, valid entry in another table. For example, every product_id in your sales table must exist in your products table.
Why: This check is essential for data integrity and data consistency between tables. When this is broken, you get “orphaned” records and inconsistent data, which can lead to failed transactions and confusing analytics.
How (SQL): This query finds all sales records where the product_id doesn’t have a matching id in the products table.
SQL
— Find sales records with no matching product in the products table
SELECT *
FROM sales
WHERE product_id NOT IN (SELECT id FROM products);
What: A freshness check verifies that your data is up-to-date by checking the most recent timestamp in a table or data source.
Why: Decisions based on stale data are unreliable and dangerous. A marketing team looking at last week’s customer data or a trading algorithm using 15-minute-old prices will make poor, ineffective choices.
How (SQL): This query retrieves the timestamp of the most recent event from your events table. In an automated workflow, you would compare this value to the current time to see if it’s within your acceptable threshold (e.g., “updated within the last hour”).
SQL
— Check the latest timestamp in a table to ensure it’s recent
SELECT max(event_timestamp) AS last_updated
FROM user_events;
What: This check verifies that values in a numerical column fall within an expected and logical range.
Why: A range check is a simple but powerful way to spot obvious errors, typos, or data corruption. An age of 150, a discount_percentage over 100, or a negative item_quantity are all clear indicators of flawed data accuracy.
How (SQL): This query finds any products in your products table where the price is either negative or an unreasonably high number (e.g., over $10,000 for a standard retail item).
SQL
— Find products with prices outside the logical range of $0 to $10,000
SELECT product_name, price
FROM products
WHERE price < 0 OR price > 10000;
What: This check ensures that values in a categorical column are from a predefined list of accepted values.
Why: This is crucial for maintaining consistency in fields that drive business logic, like order_status or customer_segment. An unexpected value like “Shippedd” (a typo) can cause the record to be excluded from reports, leading to inaccurate totals.
Real-world example: “For a Swiss bank we worked with, defining the accepted values for sensitive data elements was the first step toward building a trustworthy data catalog. This simple check prevented invalid data from ever entering their critical reports.” See the full case study here: Management and Cataloging Sensitive, Critical Data Elements in a Swiss Bank
How (SQL): This query finds any orders where the status is not one of the three approved values: ‘shipped’, ‘pending’, or ‘returned’.
SQL
— Find orders with a status that isn’t in the approved set of values
SELECT order_id, status
FROM orders
WHERE status NOT IN (‘shipped’, ‘pending’, ‘returned’);
What: Instead of checking individual values, this check monitors the total number of rows in a table after a data load and compares it to a historical average or an expected count.
Why: A sudden, drastic drop in row count (e.g., your daily sales table suddenly has 90% fewer rows than usual) is a strong signal that an upstream data source failed or the data processing job was interrupted. This check acts as a high-level “smoke test” for your data pipelines.
How (Conceptual): This is typically done with a data observability tool or a custom script. The logic is: IF today’s_row_count < (average_daily_row_count * 0.5) THEN send_alert.
What: This check verifies that the structure – or schema – of your data hasn’t changed unexpectedly. It looks for changes like a column being renamed, a data type being altered (e.g., from INTEGER to STRING), or a column being dropped entirely.
Why: Unannounced schema changes are a primary cause of broken dashboards and failed data models. A report that expects a column named revenue will fail instantly if a well-meaning engineer renames it to total_revenue.
How (Conceptual): This check is usually automated. A script or tool takes a snapshot of the schema (column_name, data_type) and compares it against the previous day’s snapshot. If there’s a difference, it triggers an alert.
What: This is a more advanced check that validates a custom business rule based on relationships between different columns.
Why: This check ensures your data makes logical sense according to your business’s rules. For example, a customer’s signup_date should never be after their first_purchase_date, or for a given order, the total_price should always equal item_price * quantity.
How (SQL): This query finds any orders where the recorded total_price doesn’t match the calculated total based on the item price and quantity, indicating a potential data entry or processing error.
SQL
— Find orders where the total price doesn’t match the calculated total
SELECT
order_id,
total_price,
(item_price * quantity) AS calculated_total
FROM order_items
WHERE total_price != (item_price * quantity);
Running these checks manually is a great start, but it doesn’t scale. To truly improve data quality for the long term, you need to operationalize your checks. This means embedding your checks into an automated, governed workflow that catches errors systematically, freeing your team to focus on innovation instead of firefighting. This section covers the best practices and tools to help you make that transition.
An effective automation strategy is about more than just scheduling scripts; it’s a disciplined approach. It begins with ruthless prioritization. Instead of trying to check everything, focus on your most critical data assets – the tables that feed your most important dashboards or support regulatory reporting. A few well-placed checks on essential data will deliver more value than a hundred checks on low-impact tables.
Furthermore, every check you implement must be actionable. A test that fails silently is a complete failure. Ensure each check is connected to a notification system that alerts the specific person or data team responsible for that data asset. Finally, treat your checks as living documentation. Anyone should be able to understand what business rule a check enforces and why it matters, making the entire system easier to maintain and troubleshoot.
To learn how to structure this at scale, explore our complete guide: How to Build a Data Quality Framework]
While you can build a system with custom scripts, specialized tools can accelerate your efforts and provide more robust capabilities. For maximum control, data engineering teams often prefer in-code libraries like Great Expectations, which allow them to define sophisticated checks directly within their data pipelines. Teams already working in dbt can leverage its native transformation layer tools to write tests that validate data as it’s being modeled.
Hint: dbt (Data Build Tool) is a popular open-source tool that lets data teams transform data in their warehouse using simple SQL. Its built-in testing feature is powerful because it allows you to write quality checks directly alongside the code that creates your data models, integrating data quality directly into the development workflow.
For those seeking a more hands-off approach, data observability platforms use machine learning to automate anomaly detection, reducing the need to write hundreds of manual rules.
For a deep-dive on how to choose your tool, read our guide: How to choose a data quality platform?
The scripts in this guide are powerful for finding issues. But as you scale, the real challenge emerges: How do you manage hundreds of checks across dozens of systems? How do you document these rules? How do you assign ownership? How do you ensure your AI models are being trained on certified, high-quality data?
This is where a dedicated data governance platform like Collibra becomes essential. It provides a central, enterprise-wide system for defining, managing, and monitoring your data quality rules, connecting them directly to your business glossary and data catalog.
For one global bank, establishing this level of control was key to strengthening their AI governance, ensuring their models were built on a foundation of trust. For a leading Swiss bank, it was the only way to effectively manage their most sensitive and critical data elements at scale.
A powerful platform is only as good as its implementation. At Murdio, we are expert Collibra partners who translate data governance theory into business reality. If you’re ready to move beyond ad-hoc scripts and build a truly scalable data quality system, talk to one of our Collibra experts today.
Ultimately, effective data quality checks are the fundamental building blocks of data trust. They are the practical, everyday actions that transform data quality from an abstract goal into a tangible reality. By starting with the ten essential checks in this guide, you can begin to systematically find and fix the errors that erode confidence and lead to poor decisions.
Remember, the journey doesn’t end with a single script. It evolves from manual checks to an automated, governed system that protects your data at every stage. Starting with even one or two of these checks is a massive and meaningful step toward creating a reliable data foundation, empowering your team to spend less time questioning the data and more time using it to drive innovation.
Now that you can identify and check for issues, the next step is tracking your progress. Learn how by building a visual summary of your data’s health: How to Create a Data Quality Dashboard.
© 2025 Murdio - All Rights Reserved - made by Netwired