Observable Data Quality with Elementary and DataHub

Tom Swann
Inside Business Insider
6 min readJul 27, 2023

--

We on the Insider Data team are big fans of dbt and our engineers, analysts, and scientists use it extensively to handle data transformation within our pipelines.

Regardless of the specific technologies used, all ELT (Extract Load Transform) data pipelines tend to follow a similar, well-worn pattern. Data is firstly extracted from various data sources — such as an external database, API, or file-based storage system.

This data is then loaded into internal storage, in the raw, where it is modified through a sequence of transformations — cleansing, normalizing, joining, and modeling it for later use in reports, dashboards, data science, or enrichment in user-facing applications.

There are potentially a number of different architectural concerns at play here — data discovery, acquisition, ingestion, transformation (The latter is the role of dbt — it is very much the ‘T’ in ELT.

Diagram showing the order of operations for a dbt-based ELT pipeline. Run each dbt model in order and persist the results to Snowflake.
Fig. 1 — Example ELT pipeline with DBT transformations

Part of the reason for dbt’s popularity is that it leans on SQL templates as the means by which users define their transformations — an approach that a wide range of people are comfortable with.

The ‘template’ part comes from Jinja, which allows for a plentiful dash of creativity when it comes to developing SQL transformations that can be easily generalized, parameterised, and thus made re-usable.

Testing Data Pipelines at “Batch Time”

All software architecture ultimately comes down to a trade-off between the simplest way to solve the problem here and now, and anticipating what is likely to change — and designing accordingly.

It is the art of just-enough-thinking-ahead such that we are able to adapt to change cost effectively whilst avoiding falling into the trap of over-engineering. This is not easy, and it requires tooling and process to support it.

Engineering teams have adopted many practices to help insulate themselves from the impact of change, such as continuous integration and automated testing.

In recognising that requirements and context will shift underneath us, having automated unit, integration, and performance tests give us the confidence to make changes. To do otherwise is to operate without a safety net in place!

In data intensive systems, it is not only the source code, schemas, and infrastructure which change, but also the data itself. This presents an additional axis of change which data engineering teams need to handle and ideally observe at “batch time” — the time when the pipeline is actually scheduled to run in production.

For example, a compile-time CI pipeline task could determine if any schemas have been modified or it could run unit tests to check that a function returns expected outputs. However, if the inputs in a live system fail or change in unpredictable ways, that creates a big hole in the resilience of the process which CI checks cannot account for.

Enter DBT Test

In addition to intuitive SQL-based transformations, dbt has another nice feature which addresses this issue of batch time data observability — dbt test.

A dbt test is a data quality check which executes against a data pipeline at run time.

You can either use the out-of-the-box tests which dbt provides, or avail of third-party test libraries provided by plugins like dbt-expectations and dbt-utils. It is also possible to write completely bespoke tests using SQL in a manner similar to writing transformations, using all the power of Jinja templating.

Let’s look at our data pipeline when modified to include a dbt test step:

Modified dbt data pipeline which runs dbt test after the model has been called.
Fig. 2 — Adding run-time data quality tests

The dbt test step will execute all of the tests we have defined for our model after it runs and the Snowflake tables have been populated with new data.

Below is an example of a simple test which checks that new rows exist for a given run of the pipeline (run_date is a variable passed in to the dbt task instance from Airflow):

tests:
- rows_exist_for_date:
date_field: 'EVENT_DATE'
date_value: var('run_date')

Elementary Data

dbt test is great, but it lacks some of the features required of a more comprehensive monitoring and observability solution — features like tracking results, anomaly detection, alerts and notifications.

This is where Elementary comes in!

There are a few things I particularly love about Elementary:

  • It has high quality documentation which is well organized, easy to understand, and focuses on clearly explaining the core use cases of the tool and helping new users to get up and running quickly
  • It extends dbt run and dbt test in a transparent way. Elementary is itself a dbt model and it integrates with the execution of dbt run and dbt test using hooks. This means that the deployment is ‘non-invasive’ — you don’t have to modify existing pipelines in order to get elementary metrics captured for ‘free’ if they already execute those commands
  • The only thing you will need to do is decide where and when you want to call the edr monitor command which determines how frequently and to which destination you would like to send the alerts which it generates (email or slack channel — see below)

Now our pipeline includes monitoring which we achieved through dbt hooks (so no modifications to the Airflow job or dbt models were required!) and alert generation via the addition of the new elementary component:

Pipeline now calls edr monitor after the tests have been run. This sends alerts to slack on failure.
Fig. 3 — Adding transparent monitoring with Elementary

Note that Elementary also stores all of its metrics in the data warehouse (Snowflake in our case) within its own ELEMENTARY schemas.

Putting it all together

Instrumenting data flows using dbt test and Elementary provides detailed diagnostic information that can help us more quickly eliminate the lines of investigation that lead down the wrong path and thus get to the root cause of problems more efficiently.

Triggering alerts when the test conditions fail mean that the engineering team can receive this diagnostic information to a Slack channel at the moment the pipeline fails. This puts us in a position of being able to react to problems closer to when they occur and to get ahead of issues before our users notice them.

A final piece of this puzzle is in how we can use the evidence of this rigorous data quality testing to give our community of data consumers better visibility of which data sets are in good condition.

DataHub

DataHub is an open-source metadata platform integrating features such as a data catalog, data pipeline lineage graphs and — crucially for this use case — observability into the individual pipeline tools, including data quality metrics from dbt.

Whilst we proactively direct notifications to our engineers using Elementary, by pushing our dbt test results to DataHub we can also give analysts, business stakeholders, and other interested parties access to full historical information on a dataset’s quality over its entire lifetime.

This includes version information for the dbt model which is very useful for tracing periods of test failure and low quality to a specific model deployment.

Our final pipeline with a full end-to-end data quality monitoring solution in place, then looks something like this:

Our final version of the pipeline also sends dbt test results to DataHub. This allows analysts to have visibility of data quality metrics.
Fig. 4 — Integrating DataHub metadata collection

datahub ingest is a configuration driven CLI tool which pushes the test results to the DataHub API endpoint.

Fig. 5 — Viewing data quality metrics in DataHub

The results of our dbt tests will now be visible for each dataset in the DataHub UI, allowing downstream consumers of our processes to see not only the full pipeline lineage, but also the fact that a series of quality control rules have been applied (and are hopefully passing!)

That’s all for our foray into data pipeline quality and metadata collection — hopefully you found some inspiration to consider for your own projects!

Additional Resources

Below are some resources which I have found very helpful in thinking about this problem space.

  • Integrating dbt + Airflow: Our architecture relies on Airflow (scheduling) and dbt (transformations) working together in close concert. This article contains some useful documentation from dbt on the topic
  • Elementary Tutorial — A really useful walkthrough on getting started with the elementary dbt extensions
  • DataHub Integrations Catalog — a key consideration for metadata tools is obviously “Does it integrate with my stuff?” The answer to that question is here.

--

--

Tom Swann
Inside Business Insider

Botherer of data, player of games. All my views are materialised.