08-03-2022 10:07 AM
What was the established architectural pattern for doing streaming ETL with Delta Lake before DLT was a thing? And incidentally, what approach would you take in the context of delta-oss today? The pipeline definitions would not have had to be declarative (as in DLT), but just in general.
One solution I am aware of is to e.g. rely on structured streaming and Trigger.Once in combination with an external orchestrator to execute the processing steps between the layers in delta. But what I'm interested in are use cases with end-to-end (bronze -> silver -> gold pipeline) latencies of less than a minute. This rules out at least some orchestrators.
So to summarize:
How did/do people tackle these types of scenarios?
08-19-2022 02:49 PM
@Veli-Jussi Raitila - Travelling back in the timeline prior to DLT, Please find the below documentation describes these scenarios and how this was be tackled - https://www.databricks.com/discover/getting-started-with-delta-lake-tech-talks/beyond-lambda-introdu...
08-21-2022 10:26 PM
Thank you for the link. This one from Denny Lee is indeed very good.
However, it suffers from the same issue as so many other presentations regarding the subject. Mainly that it brushes over the implementation of an actual processing "chain" from bronze, though silver to gold.
Many takes on the topic do mention that actual use cases involve multiple processing layers/steps, deduplication, joins and other intermediate hops (Denny also used the term "materializing data frames"). But for some reason choose not to demonstrate them.
This happens here as well. Two concurrent writes (and a read) to a single table are shown in order to showcase the ACID guarantees with Delta. But no example of the "Delta Architecture" with a continuous stream through bronze, silver and gold is actually presented.
Do sessions exist which concentrate on this aspect?
I'm particularly interested in the operational characteristics of such a solution, how to "synchronize" the steps within a continuous processing chain, what if one of the - essentially parallel - streaming jobs in the middle fails, how to visualize or otherwise inspect the dependencies of said jobs, etc. And especially if it's a bit more than mere 1-3 of such streaming pipelines one needs to manage and understand.
EDIT: To make this even more concrete, in batch mode, e.g. with the help of an external orchestrator, one can break down a processing pipeline from staging-bronze, bronze-silver, silver-gold. The whole chain can then be visualized within a DAG and understood. One can immediately see the dependencies between different processing steps, monitor, and pinpoint issues if one of them fails, understand the context in which they are executed, etc.
DLT brings these elements to a streaming pipeline: dependencies are formed declaratively, and even visualized as a (DLT-specific) DAG. But how did people cater to these needs before?
09-07-2022 12:25 AM
Hi @Veli-Jussi Raitila
Does @Shanmugavel Chandrakasu response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?
We'd love to hear from you.
Thanks!
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now