Databricks Community

Greg_c · ‎01-16-2025

Hello everyone,

I couldn't find a topic on this - what are your best practices to ensuring data quality in batch pipelines?

I've got a big pipeline processing data once per day. We though about either going with DBT or DLT but DLT seems more directed for streaming, not daily batches. DBT adds additional complexity but is good for batches though.

Can you share your thoughts on this and how you do it?

Alberto_Umana · ‎01-16-2025

Hello @Greg_c,

Please review this articules:

https://www.databricks.com/discover/pages/data-quality-management

https://docs.databricks.com/en/lakehouse-architecture/reliability/best-practices.html

Greg_c · ‎01-23-2025

Thank you @Alberto_Umana - however would be great to share some personal experience.

saurabh18cs · ‎01-24-2025

how about adding data profiling or data consistency checks to your batch pipeline?

Few Examples:

# Data profiling

df.describe().show()

# Data consistency checks

# Check for null values

null_counts = df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns])

null_counts.show()

# Remove duplicates

df = df.dropDuplicates()

# Handle missing values

df = df.fillna({"age": 0, "email": "unknown@example.com"})

# Standardize data

df = df.withColumn("email", lower(col("email")))

# Data quality metrics

total_records = df.count()

null_email_count = df.filter(col("email").isNull()).count()

print(f"Total records: {total_records}, Null email count: {null_email_count}")

Isi · ‎01-24-2025

Hey Greg_c

I use DBT daily for batch data ingestion, and I believe it’s a great option. However, it’s important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to their development process.

If you are already satisfied with your current ingestion tool and your main concern is ensuring data quality, I would recommend two key approaches. From what I understand, the issue arises when your batch process completes the entire data pipeline from raw to gold, and only at the BI layer do you realize that the data is incorrect, forcing a reprocessing effort.

To prevent such issues, you could implement:

1.Constraints at the table level: Ensure that if data doesn’t meet specific conditions (e.g., values below a threshold or unexpected nulls), the ingestion fails, preventing bad data from propagating.

2.SQL Alerts: Set up alerts that notify you if incorrect values appear in your data, enabling proactive intervention.

3.Lakehouse monitoring dashboards: Use dashboards to monitor data quality both as snapshots and as time-series trends, which can help identify anomalies over time.

4.Data validation at the source: The most effective approach is to implement validation controls before loading data into the data lake, ensuring data integrity from the start.

Regarding DLT (Delta Live Tables), I believe it can also be used for batch processes depending on the configurations you set up. However, in my experience, it tends to be more expensive compared to using open-source solutions or the options mentioned above.

There are numerous ways to manage data quality in batch pipelines, but a proactive approach with source validation and monitoring is often the most effective.

If you find this answer helpful, feel free to mark it as resolved or give it a 👍!

🙂

Databricks Community

Best practices for ensuring data quality in batch pipelines

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST