cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best practices for ensuring data quality in batch pipelines

Greg_c
New Contributor II

Hello everyone,

I couldn't find a topic on this - what are your best practices to ensuring data quality in batch pipelines?

I've got a big pipeline processing data once per day. We though about either going with DBT or DLT but DLT seems more directed for streaming, not daily batches. DBT adds additional complexity but is good for batches though.

Can you share your thoughts on this and how you do it?

4 REPLIES 4

Alberto_Umana
Databricks Employee
Databricks Employee

Greg_c
New Contributor II

Thank you @Alberto_Umana  - however would be great to share some personal experience.

saurabh18cs
Valued Contributor III

how about adding data profiling or data consistency checks to your batch pipeline?

Few Examples: 

# Data profiling
df.describe().show()

# Data consistency checks
 
# Check for null values
null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])
null_counts.show()

# Remove duplicates
df = df.dropDuplicates()

# Handle missing values
df = df.fillna({"age": 0, "email": "unknown@example.com"})

# Standardize data
df = df.withColumn("email", lower(col("email")))

# Data quality metrics
total_records = df.count()
null_email_count = df.filter(col("email").isNull()).count()
print(f"Total records: {total_records}, Null email count: {null_email_count}")

Isi
New Contributor II

Hey Greg_c

I use DBT daily for batch data ingestion, and I believe itโ€™s a great option. However, itโ€™s important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to their development process.

If you are already satisfied with your current ingestion tool and your main concern is ensuring data quality, I would recommend two key approaches. From what I understand, the issue arises when your batch process completes the entire data pipeline from raw to gold, and only at the BI layer do you realize that the data is incorrect, forcing a reprocessing effort.

To prevent such issues, you could implement:

1.Constraints at the table level: Ensure that if data doesnโ€™t meet specific conditions (e.g., values below a threshold or unexpected nulls), the ingestion fails, preventing bad data from propagating.

2.SQL Alerts: Set up alerts that notify you if incorrect values appear in your data, enabling proactive intervention.

3.Lakehouse monitoring dashboards: Use dashboards to monitor data quality both as snapshots and as time-series trends, which can help identify anomalies over time.

4.Data validation at the source: The most effective approach is to implement validation controls before loading data into the data lake, ensuring data integrity from the start.

Regarding DLT (Delta Live Tables), I believe it can also be used for batch processes depending on the configurations you set up. However, in my experience, it tends to be more expensive compared to using open-source solutions or the options mentioned above.

There are numerous ways to manage data quality in batch pipelines, but a proactive approach with source validation and monitoring is often the most effective.

If you find this answer helpful, feel free to mark it as resolved or give it a ๐Ÿ‘!

๐Ÿ™‚

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group