Snowflake Technical Lineage for Collibra (Case Study)

In data governance, one of the key elements is being able to track data as it moves across systems and gets transformed, so that everyone can be certain they’re using the right data. Here’s how technical lineage works in Snowflake and how to use it with your Collibra data catalog.

Data lineage vs. technical lineage

Let’s start with the basics first and clear up some definitions around Snowflake and technical lineage, as lineage tends to sometimes get confused.

(We talk about types of lineage in detail in our article “Data catalog vs. data lineage.”)

The terms data lineage and technical lineage are closely related, but they refer to slightly different scopes of metadata tracing within a data ecosystem.

Here’s the difference in a nutshell:

Data lineage

Data lineage refers to the life cycle of data – how it moves through the system from origin to destination, including transformations, processes, and business logic applied along the way.

The focus of data lineage includes:

Business-level understanding of data flow
Where the data came from (data sources)
How it’s transformed (ETL/ELT logic, data quality rules)
Where it’s going (reports, dashboards, models)

Technical lineage

Technical lineage is a subset of data lineage and dives into the underlying technical components. In other words, it shows how data flows across tables, columns, files, pipelines, scripts, and job executions at a more granular and system-specific level.

It focuses on:

Column-level or table-level tracing
Code-level dependencies (e.g., SQL joins, transformation scripts)
ETL pipeline execution details
Tools, systems, and infrastructure involved

What’s ETL in lineage?

The ETL acronym stands for Extract, Transform, Load. It’s a process used to integrate and move data from multiple sources into a single destination, convert it into a usable format, and store it in a central repository like a data warehouse.

Here’s a very simple breakdown of each stage:

Extract: Data is gathered from various sources, which can include databases, applications, flat files, and more.
Transform: The extracted data is cleaned, formatted, and restructured to meet the needs of the target system. This might involve filtering, sorting, aggregating, joining, deduplicating, and validating data.
Load: The transformed data is loaded into the target system, such as a data warehouse, data lake, or other target database.

Technical lineage in Snowflake

Now, with definitions out of the way, let’s take a look at Snowflake and technical lineage.

Or, let’s clarify one more definition first.

Snowflake is a cloud-based data platform providing a data warehouse as a service (DWaaS). It allows businesses to store, manage, analyze, and share data, offering a scalable and efficient solution for different data workloads.

It’s known for its unique architecture that separates storage and compute, enabling independent scaling and high concurrency. And it uses standard SQL for data querying and manipulation, making it accessible to a wide range of users and tools.

Snowflake also offers lineage, allowing you to track data flow and transformations within your Snowflake environment and providing insights into how data is created, modified, and consumed. It helps with data governance, impact analysis, debugging, and documentation by visualizing the relationships between Snowflake objects.

Lineage is obviously crucial for data analytics and reporting, as it supports accurate insights and decision-making. With a clear view of data sources and transformations, analysts can trust the data they work with.

Snowflake and Collibra working together

Collibra provides complete visibility and transparency of all the data stored in Snowflake Data Cloud. Technical lineage, including column-level lineage and transformations, gives you a complete picture of how data transforms, flows, and is used from system to system.

End-to-end lineage is, more often than not, a very complex process that involves connecting to BI and ETL tools, as well as SQL data sources. Collibra offers automatic stitching capabilities for most of its technical lineage sources, creating a connection between the data objects in your technical lineage and the relevant assets in the Collibra Data Catalog.

To see how Collibra and Snowflake both work in action, take a look at this presentation from The Data Intelligence Conference.

Snowflake custom technical lineage for Collibra: A case study

In one of our latest projects for a leading company in the pharmaceutical industry, we worked on applying Snowflake’s custom technical lineage in Collibra.

The goal

We needed to obtain information about data objects, attributes, and ETL processes from Snowflake in Collibra data catalog. We also wanted to establish and streamline the flow of information between the data objects.

To translate it into business goals, this is really important for:

Generating reports to know which sources to check
Optimizing operations, clearly seeing how time-consuming processes around data are, to make them more efficient
Having full knowledge of all the systems that take part in data processing
Making sure data is safe and secure, tracking the flow of sensitive data, and allowing to identify potential vulnerabilities – essential for regulatory compliance
Enhancing trust in the data by understanding the source and target objects and columns.

How we did it

In general, there are several data ingestion methods when it comes to technical lineage:

SQL- based ingestion method. The easiest but also the least robust lineage, with no information about dependencies.
SQL-API ingestion method. A more robust method, but still not including all dependencies and all ETL operations. (It’s still in the works, with Collibra working to improve it.)
Shared storage, requiring you to place your SQL files in your Shared Storage connection folder. This can give you very precise lineage, but it’s the most time consuming of all the methods.

Snowflake lets you use a mix of ingestion method, and what we ended up doing was the SQL-API ingestion method combined with the shared storage method.

But it wasn’t smooth sailing at first.

We ran into issues with data parsing in Collibra when using the SQL-API method first. We reached out to Collibra, and they responded quickly to make the necessary updates in the system.

Collibra also shared Python scripts on their marketplace for the shared storage method. You collect data object information in a CSV file, and then the script parses it and adds it to the Edge connector.

What then happens in Collibra is stitching, connecting the data objects in your technical lineage to the relevant assets in the Collibra Data Catalog. As data attributes are mapped in the catalog, you gain insight into how data flows and a complete understanding of the data landscape and all essential metadata.

The result

Using a combination of the two methods – SQL-API and shared storage – we managed to deal with 65k queries for the client in just 3 hours.

The Snowflake custom technical lineage in Collibra turned out to be a way to optimize business processes, with no need to ever log in to Snowflake, and all metadata flows easily visible in the data catalog, creating a logical schema.

And while Collibra experts are currently working on making the ingestion methods more robust and easier to apply, we’ve managed to devise a way to apply Snowflake’s technical lineage in Collibra more efficiently at scale.

Which, by the way, is a huge advantage of working with external Collibra experts, who know what to ask for and where to improve. We’ve also submitted several ideas on Collibra’s Ideation Platform that are meant to make this process less complex in the near future.

The key is up-to-date lineage information

As data sources and business processes change, it’s key to regularly update your technical lineage information in your data catalog. This is exactly why revisiting the process is so important as you introduce new data sources and modify existing ones.

It’s how you can make sure that stakeholders have access to reliable and timely data lineage information, understand the impact of data changes across the organization, and can make informed decisions based on that data.

If you need help connecting your Snowflake lineage to your Collibra data catalog, reach out to our experts – we’ll find the best way to set up the process for your organization’s needs.

Snowflake Custom Technical Lineage for Collibra (Case Study Included)