Databricks Community

vjraitila · ‎08-03-2022

What was the established architectural pattern for doing streaming ETL with Delta Lake before DLT was a thing? And incidentally, what approach would you take in the context of delta-oss today? The pipeline definitions would not have had to be declarative (as in DLT), but just in general.

One solution I am aware of is to e.g. rely on structured streaming and Trigger.Once in combination with an external orchestrator to execute the processing steps between the layers in delta. But what I'm interested in are use cases with end-to-end (bronze -> silver -> gold pipeline) latencies of less than a minute. This rules out at least some orchestrators.

So to summarize:

A streaming pipeline with bronze, silver and gold tables
End-to-end latency in the order of seconds (not sub-second, but less than a minute). Trigger.Once and batching with the help of an external orchestrator is not applicable if it results in higher latencies
The use case would be "operational" i.e. the pipeline would have to be monitorable, recoverable and resumable if failed, debuggable, testable etc. even if it would require non-trivial amount of framework development

How did/do people tackle these types of scenarios?

shan_chandra · ‎08-19-2022

@Veli-Jussi Raitila - Travelling back in the timeline prior to DLT, Please find the below documentation describes these scenarios and how this was be tackled - https://www.databricks.com/discover/getting-started-with-delta-lake-tech-talks/beyond-lambda-introdu...

vjraitila · ‎08-21-2022

Thank you for the link. This one from Denny Lee is indeed very good.

However, it suffers from the same issue as so many other presentations regarding the subject. Mainly that it brushes over the implementation of an actual processing "chain" from bronze, though silver to gold.

Many takes on the topic do mention that actual use cases involve multiple processing layers/steps, deduplication, joins and other intermediate hops (Denny also used the term "materializing data frames"). But for some reason choose not to demonstrate them.

This happens here as well. Two concurrent writes (and a read) to a single table are shown in order to showcase the ACID guarantees with Delta. But no example of the "Delta Architecture" with a continuous stream through bronze, silver and gold is actually presented.

Do sessions exist which concentrate on this aspect?

I'm particularly interested in the operational characteristics of such a solution, how to "synchronize" the steps within a continuous processing chain, what if one of the - essentially parallel - streaming jobs in the middle fails, how to visualize or otherwise inspect the dependencies of said jobs, etc. And especially if it's a bit more than mere 1-3 of such streaming pipelines one needs to manage and understand.

EDIT: To make this even more concrete, in batch mode, e.g. with the help of an external orchestrator, one can break down a processing pipeline from staging-bronze, bronze-silver, silver-gold. The whole chain can then be visualized within a DAG and understood. One can immediately see the dependencies between different processing steps, monitor, and pinpoint issues if one of them fails, understand the context in which they are executed, etc.

DLT brings these elements to a streaming pipeline: dependencies are formed declaratively, and even visualized as a (DLT-specific) DAG. But how did people cater to these needs before?

Vidula · ‎09-07-2022

Hi @Veli-Jussi Raitila

Does @Shanmugavel Chandrakasu response answer your question? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

We'd love to hear from you.

Thanks!