Data quality testing cheat sheet: 10 effective data quality checks

Data quality testing cheat sheet: 10 effective data quality checks

01 01
1970

In the world of data, these checks are your first line of defense against common data quality issues. Costly business mistakes, skewed analytics, and failed AI projects often stem from inaccurate data and a failure to maintain data integrity.

While understanding concepts like data quality frameworks and metrics is crucial, this guide is a hands-on manual focused on one thing: the specific, practical data quality checks you can implement today. We’ll provide the ‘what,’ ‘why,’ and ‘how’ for each essential check, complete with code you can adapt for your own projects.

We’ll cover the most common types of checks with real-world examples to help you find and fix data errors before they cause damage. If you need a refresher on the fundamentals on data quality, we’ll link you to our dedicated guides along the way, but for now, let’s get our hands dirty and start building trust in your data.

Link to: What is Data Quality?

Key takeaways

  • Begin by implementing foundational checks like Null, Uniqueness, and Format tests to find the most common data quality issues.
  • Leverage the copy-paste SQL and Python snippets in this guide for immediate, hands-on data quality testing.
  • Implement data checks at critical stages – Ingestion, Transformation, and Serving – to catch errors early and prevent data problems.
  • Move from manual, ad-hoc checks to an automated system using a data governance platform to manage your quality rules effectively.

The anatomy of a data quality check

Before you start writing code, it’s crucial to understand what you’re looking for. This often starts with initial data profiling to get a high-level overview of your dataset. A data quality check isn’t a random test; it’s a targeted question you ask to verify your data adheres to defined quality standards.

This section provides the foundational context you need to perform checks that are both effective and efficient.

Understanding data quality dimensions: what to check for

Every data quality check is designed to test against a core standard or “dimension” of quality.

Think of these as the categories for your tests. While the theory runs deep, you only need to know the basics to get started:

  • Accuracy: Is the data correct and true to its source?
  • Completeness: Is all required data present, or are you dealing with missing data?
  • Uniqueness: Are there any duplicate records that could inflate your numbers?
  • Validity: Does the data follow the rules and formatting it’s supposed to?
  • Timeliness: Is the data fresh enough to be useful for decision-making?

These dimensions give your checks a clear purpose. Instead of just “checking the data,” you’re asking a specific question like, “Is the order_date column valid?” or “Is the customer_id field unique?”

To explore the theory and specific metrics behind each dimension, see our detailed guide.

When and where to perform checks: a data lifecycle view

Running the right check at the wrong time is ineffective. To maximize your impact, it’s best to align your checks with the data’s journey through your systems. This ensures you catch errors at the most logical and cost-effective point.

Stage Purpose & importance Common checks performed
At ingestion Act as the first gatekeeper. The goal is to prevent data errors at the source, stopping malformed or unexpected data from ever entering your data warehouse.
  • Schema validation
  • File format & type checks
  • Column count & naming
During transformation Ensure the accuracy of business logic. This is where most detailed checks occur after data has been cleaned, joined, and modeled.
  • Uniqueness & Referential Integrity
  • Business Rule Validation
  • Range & Set Checks
Before serving Act as the final quality gate. This protects BI dashboards, ML models, and end-users from consuming inaccurate or stale information.
  • Data freshness/Timeliness checks
  • Aggregate & anomaly checks
  • Final completeness checks

 

A playbook of data quality testing methods: 10 essential checks

This is the heart of the guide – a practical playbook of the ten most critical and effective data quality checks.

Each entry follows a simple What, Why, and How format, complete with a copy-paste code snippet in either SQL or Python with Pandas.

Use these examples as a starting point to build a robust testing suite for your most important data assets.

1. The null check (completeness)

What: This is the most fundamental check. It scans a column for NULL or empty values to ensure no critical information is missing.

Why: A missing email prevents you from contacting a customer; a missing sales figure breaks your financial reports. Ensuring the completeness of essential data is the first step toward building trust.

How (SQL): This query counts the number of rows in your orders table where the customer_id – a critical field for linking to customer information – is missing.

SQL

— Find rows where a critical column is empty

SELECT count(*)

FROM orders

WHERE customer_id IS NULL;

2. The uniqueness check (uniqueness)

What: This check verifies that all values in a column that should be unique, like a primary key or an order ID, have no duplicates.

Why: Dealing with duplicate data can severely skew your metrics, leading to inflated customer counts, inaccurate sales figures, and broken downstream processes that rely on a unique identifier for joins.

How (SQL): This query groups your orders table by order_id and then uses a HAVING clause to find any IDs that appear more than once.

SQL

— Find duplicate order IDs that should be unique

SELECT

order_id,

count(*)

FROM orders

GROUP BY order_id

HAVING count(*) > 1;

3. The format check with regex (validity)

What: This check uses a regular expression (RegEx) to ensure that data in a field conforms to a standard pattern, like an email address or a phone number.

Why: Data that isn’t in the correct format is often unusable. An email address without an “@” symbol is just a useless string of text, and a date in the wrong format can cause your application or analytics query to fail.

How (Python with Pandas): This snippet finds all rows in a DataFrame where the email column doesn’t match the standard local-part@domain email format.

Python

# Find email addresses that DON’T match the standard format

invalid_emails = df[~df[’email’].str.contains(r'[^@]+@[^@]+\.[^@]+’, na=False)]

print(invalid_emails)

4. The referential integrity check (consistency)

What: This check ensures that a value in one table (a foreign key) has a corresponding, valid entry in another table. For example, every product_id in your sales table must exist in your products table.

Why: This check is essential for data integrity and data consistency between tables. When this is broken, you get “orphaned” records and inconsistent data, which can lead to failed transactions and confusing analytics.

How (SQL): This query finds all sales records where the product_id doesn’t have a matching id in the products table.

SQL

— Find sales records with no matching product in the products table

SELECT *

FROM sales

WHERE product_id NOT IN (SELECT id FROM products);

5. The freshness check (timeliness)

What: A freshness check verifies that your data is up-to-date by checking the most recent timestamp in a table or data source.

Why: Decisions based on stale data are unreliable and dangerous. A marketing team looking at last week’s customer data or a trading algorithm using 15-minute-old prices will make poor, ineffective choices.

How (SQL): This query retrieves the timestamp of the most recent event from your events table. In an automated workflow, you would compare this value to the current time to see if it’s within your acceptable threshold (e.g., “updated within the last hour”).

SQL

— Check the latest timestamp in a table to ensure it’s recent

SELECT max(event_timestamp) AS last_updated

FROM user_events;

6. Ensure data accuracy: the range check

What: This check verifies that values in a numerical column fall within an expected and logical range.

Why: A range check is a simple but powerful way to spot obvious errors, typos, or data corruption. An age of 150, a discount_percentage over 100, or a negative item_quantity are all clear indicators of flawed data accuracy.

How (SQL): This query finds any products in your products table where the price is either negative or an unreasonably high number (e.g., over $10,000 for a standard retail item).

SQL

— Find products with prices outside the logical range of $0 to $10,000

SELECT product_name, price

FROM products

WHERE price < 0 OR price > 10000;

7. The set check (validity)

What: This check ensures that values in a categorical column are from a predefined list of accepted values.

Why: This is crucial for maintaining consistency in fields that drive business logic, like order_status or customer_segment. An unexpected value like “Shippedd” (a typo) can cause the record to be excluded from reports, leading to inaccurate totals.

Real-world example: “For a Swiss bank we worked with, defining the accepted values for sensitive data elements was the first step toward building a trustworthy data catalog. This simple check prevented invalid data from ever entering their critical reports.” See the full case study here: Management and Cataloging Sensitive, Critical Data Elements in a Swiss Bank

How (SQL): This query finds any orders where the status is not one of the three approved values: ‘shipped’, ‘pending’, or ‘returned’.

SQL

— Find orders with a status that isn’t in the approved set of values

SELECT order_id, status

FROM orders

WHERE status NOT IN (‘shipped’, ‘pending’, ‘returned’);

8. Monitoring data processing: the row count anomaly check

What: Instead of checking individual values, this check monitors the total number of rows in a table after a data load and compares it to a historical average or an expected count.

Why: A sudden, drastic drop in row count (e.g., your daily sales table suddenly has 90% fewer rows than usual) is a strong signal that an upstream data source failed or the data processing job was interrupted. This check acts as a high-level “smoke test” for your data pipelines.

How (Conceptual): This is typically done with a data observability tool or a custom script. The logic is: IF today’s_row_count < (average_daily_row_count * 0.5) THEN send_alert.

9. The schema change check (consistency)

What: This check verifies that the structure – or schema – of your data hasn’t changed unexpectedly. It looks for changes like a column being renamed, a data type being altered (e.g., from INTEGER to STRING), or a column being dropped entirely.

Why: Unannounced schema changes are a primary cause of broken dashboards and failed data models. A report that expects a column named revenue will fail instantly if a well-meaning engineer renames it to total_revenue.

How (Conceptual): This check is usually automated. A script or tool takes a snapshot of the schema (column_name, data_type) and compares it against the previous day’s snapshot. If there’s a difference, it triggers an alert.

10. The business logic check (accuracy)

What: This is a more advanced check that validates a custom business rule based on relationships between different columns.

Why: This check ensures your data makes logical sense according to your business’s rules. For example, a customer’s signup_date should never be after their first_purchase_date, or for a given order, the total_price should always equal item_price * quantity.

How (SQL): This query finds any orders where the recorded total_price doesn’t match the calculated total based on the item price and quantity, indicating a potential data entry or processing error.

SQL

— Find orders where the total price doesn’t match the calculated total

SELECT

order_id,

total_price,

(item_price * quantity) AS calculated_total

FROM order_items

WHERE total_price != (item_price * quantity);

Automating data quality: from manual checks to effective data quality management

Running these checks manually is a great start, but it doesn’t scale. To truly improve data quality for the long term, you need to operationalize your checks. This means embedding your checks into an automated, governed workflow that catches errors systematically, freeing your team to focus on innovation instead of firefighting. This section covers the best practices and tools to help you make that transition.

Best practices for implementing checks

An effective automation strategy is about more than just scheduling scripts; it’s a disciplined approach. It begins with ruthless prioritization. Instead of trying to check everything, focus on your most critical data assets – the tables that feed your most important dashboards or support regulatory reporting. A few well-placed checks on essential data will deliver more value than a hundred checks on low-impact tables.

Furthermore, every check you implement must be actionable. A test that fails silently is a complete failure. Ensure each check is connected to a notification system that alerts the specific person or data team responsible for that data asset. Finally, treat your checks as living documentation. Anyone should be able to understand what business rule a check enforces and why it matters, making the entire system easier to maintain and troubleshoot.

To learn how to structure this at scale, explore our complete guide: How to Build a Data Quality Framework]

Choosing your data quality tools: a guide for the data team

While you can build a system with custom scripts, specialized tools can accelerate your efforts and provide more robust capabilities. For maximum control, data engineering teams often prefer in-code libraries like Great Expectations, which allow them to define sophisticated checks directly within their data pipelines. Teams already working in dbt can leverage its native transformation layer tools to write tests that validate data as it’s being modeled.

Hint: dbt (Data Build Tool) is a popular open-source tool that lets data teams transform data in their warehouse using simple SQL. Its built-in testing feature is powerful because it allows you to write quality checks directly alongside the code that creates your data models, integrating data quality directly into the development workflow.

For those seeking a more hands-off approach, data observability platforms use machine learning to automate anomaly detection, reducing the need to write hundreds of manual rules.

For a deep-dive on how to choose your tool, read our guide: How to choose a data quality platform?

Scaling up with enterprise data governance

The scripts in this guide are powerful for finding issues. But as you scale, the real challenge emerges: How do you manage hundreds of checks across dozens of systems? How do you document these rules? How do you assign ownership? How do you ensure your AI models are being trained on certified, high-quality data?

This is where a dedicated data governance platform like Collibra becomes essential. It provides a central, enterprise-wide system for defining, managing, and monitoring your data quality rules, connecting them directly to your business glossary and data catalog.

For one global bank, establishing this level of control was key to strengthening their AI governance, ensuring their models were built on a foundation of trust. For a leading Swiss bank, it was the only way to effectively manage their most sensitive and critical data elements at scale.

A powerful platform is only as good as its implementation. At Murdio, we are expert Collibra partners who translate data governance theory into business reality. If you’re ready to move beyond ad-hoc scripts and build a truly scalable data quality system, talk to one of our Collibra experts today.

Conclusion: checks are your first line of defense

Ultimately, effective data quality checks are the fundamental building blocks of data trust. They are the practical, everyday actions that transform data quality from an abstract goal into a tangible reality. By starting with the ten essential checks in this guide, you can begin to systematically find and fix the errors that erode confidence and lead to poor decisions.

Remember, the journey doesn’t end with a single script. It evolves from manual checks to an automated, governed system that protects your data at every stage. Starting with even one or two of these checks is a massive and meaningful step toward creating a reliable data foundation, empowering your team to spend less time questioning the data and more time using it to drive innovation.

Now that you can identify and check for issues, the next step is tracking your progress. Learn how by building a visual summary of your data’s health: How to Create a Data Quality Dashboard.

Insights & News