04-02-2023 02:30 AM
Hello Databricks community,
I'm working on a pipeline and would like to implement a common use case using Delta Live Tables. The pipeline should include the following steps:
The motivation behind this implementation is to handle new data as a Spark batch because Spark Streaming does not support many commonly required aggregations. Additionally, this approach is intended to handle pipeline failures that may arise due to new deployments or unexpected changes in the data. These changes can potentially break the transformations or processing on the data, resulting in downtime. After deploying fixes, the pipeline should recover by loading and processing failed batches without recomputing everything historically. This recovery mechanism helps avoid incurring huge costs when dealing with large volumes of data.
I am seeking guidance on the best practices for implementing this scenario using Delta Live Tables. In particular, how can I ensure that the pipeline correctly handles previously failed batches and processes them along with new data, while also providing a robust recovery mechanism?
Any help or insights would be greatly appreciated!
Thank you in advance!
04-04-2023 06:35 PM
@Valentin Rosca :
Delta Live Tables can be used to implement the scenario you described in the following way:
By following these best practices and utilizing the features of Delta Live Tables, you can implement a robust pipeline that handles new data as a Spark batch, recovers from failures, and provides reliable data processing for your use case.
04-05-2023 01:32 AM
By following these best practices and utilizing the features of Delta Live Tables, you can implement a robust pipeline that handles new data as a Spark batch, recovers from failures, and provides reliable data processing for your use case. -> We would love to but it seems we are very limited in what we can do with best practices alone, especially if the documentation lacks proper examples for said best practices.
04-04-2023 11:43 PM
Hi @Valentin Rosca
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
04-05-2023 01:34 AM
Hello, no, I did find a solution to this (I am experimenting with something for now) and there is no answer yet. More help is always welcome.
04-19-2023 10:46 AM
Some ways to do this is if you build your own checkpointing logic and load data based on your own updated_at or similar field / delta versions and readChangeFeed, although the latter I did not test yet. The checkpointing logic should be added to dlt by reading the dependencies with limit 0, so better stick to non-DLT implementations if you require this as this is a hack. If I will continue with this route I will make sure to provide some code as well.
09-21-2024 10:45 AM
I totally agree that this is a gap in the Databricks solution. This gap exists between a static read and real time streaming. My problem (and suspect there are many use cases) is that I have slowly changing data coming into structured folders via csv that are updated monthly. Most of the documentation is either for a static read of that or overly complex stream processing using watermarks, batch processing, etc in minutes, seconds, or real time.
With that said, here are a few thoughts for you:
Just my two cents.
Lee
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group