Hey Greg_c
I use DBT daily for batch data ingestion, and I believe itโs a great option. However, itโs important to consider that adopting DBT introduces additional complexity, and the team should carefully evaluate the impact of adding a new tool to their development process.
If you are already satisfied with your current ingestion tool and your main concern is ensuring data quality, I would recommend two key approaches. From what I understand, the issue arises when your batch process completes the entire data pipeline from raw to gold, and only at the BI layer do you realize that the data is incorrect, forcing a reprocessing effort.
To prevent such issues, you could implement:
1.Constraints at the table level: Ensure that if data doesnโt meet specific conditions (e.g., values below a threshold or unexpected nulls), the ingestion fails, preventing bad data from propagating.
2.SQL Alerts: Set up alerts that notify you if incorrect values appear in your data, enabling proactive intervention.
3.Lakehouse monitoring dashboards: Use dashboards to monitor data quality both as snapshots and as time-series trends, which can help identify anomalies over time.
4.Data validation at the source: The most effective approach is to implement validation controls before loading data into the data lake, ensuring data integrity from the start.
Regarding DLT (Delta Live Tables), I believe it can also be used for batch processes depending on the configurations you set up. However, in my experience, it tends to be more expensive compared to using open-source solutions or the options mentioned above.
There are numerous ways to manage data quality in batch pipelines, but a proactive approach with source validation and monitoring is often the most effective.
If you find this answer helpful, feel free to mark it as resolved or give it a ๐!
๐