topic Clarity on usage STREAM while defining DLT tables in Data Engineering

Clarity on usage STREAM while defining DLT tables

lokeshr — Tue, 02 Aug 2022 16:51:52 GMT

Hi, I am currently trying to learn Databricks and going through tutorials and learning materials. I came across this link https://databricks.com/discover/pages/getting-started-with-delta-live-tables

While I get most of what is described in page, I find it hard to understand why while building silver tier one of the bronze tables, sales_orders_raw, is mentioned with keyword STREAM other bronze table,customers, is just using marker LIVE. Shouldn't both be marked with STREAM as well as LIVE. Is this some typo?

Regards,

Lokesh

Re: Clarity on usage STREAM while defining DLT tables

tomasz — Wed, 03 Aug 2022 18:57:14 GMT

This is because in the example "sales_orders" data is being streamed, joined (using left join) to customers, and being appended to the silver layer table. When a sales_order comes in from a customer that was inserted some time ago (rather than in the current micro-batch being processed) the entire customer table has to be loaded to find that customer id and name. Therefore using LIVE.customers without "STREAMING" allows the join to be a stream-batch join (as described here).

Essentially because you only need the most recent records coming in from "sales_orders" you can use the "STREAM" keyword but the join requires the entire customer table to be loaded and hence the lack of the "STREAM" keyword there.

On the other side of the coin, you need to update the silver layer table only when a new sales_order comes in, not when a new customer is streamed into the bronze layer. That's another reason why you only need the STREAM on the sales_order table.

Re: Clarity on usage STREAM while defining DLT tables

jose_gonzalez — Tue, 30 Aug 2022 17:18:59 GMT

Hi @Lokesh Raju,

Just a friendly follow-up. Did Tomasz's response help you to resolved your question? If it did, please mark it as best.