cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

When should I use STREAM() when defining a DLT table?

Mado
Valued Contributor II

Hi,

I am a little confused when I should use STREAM() when we define a table based on a DLT table.

There is a pattern explained in the documentation.

CREATE OR REFRESH STREAMING LIVE TABLE streaming_bronze
 
AS SELECT * FROM cloud_files(
 
  "s3://path/to/raw/data", "json"
 
)
 
 
 
CREATE OR REFRESH STREAMING LIVE TABLE streaming_silver
 
AS SELECT * FROM STREAM(LIVE.streaming_bronze) WHERE...
 
 
 
CREATE OR REFRESH LIVE TABLE live_gold
 
AS SELECT count(*) FROM LIVE.streaming_silver GROUP BY user_id

In the above code, "live_gold" is a live able. Since "streaming_silver" is a streaming live table, I expected the last line of the code to be:

AS SELECT count(*) FROM STREAM(LIVE.streaming_silver) GROUP BY user_id

However, STREAM() is not used in the definition of "live_gold".

I was wondering when I should use STREAM() when defining a live table.

1 ACCEPTED SOLUTION

Accepted Solutions

LandanG
Databricks Employee
Databricks Employee

@Mohammad Saber​ I think I see what you're saying. You only need to include the STREAM() function when declaring a STREAMING LIVE TABLE, it doesn't matter the source of data.

So since the live_gold table is just a LIVE TABLE and not a STREAMING LIVE TABLE you do not need to include that function. Hopefully, that makes sense

View solution in original post

3 REPLIES 3

LandanG
Databricks Employee
Databricks Employee

Hi @Mohammad Saber​ ,

You use STREAM() when creating a STREAMING LIVE TABLE and reading other tables in the pipeline. The key difference is the STREAMING LIVE TABLE vs. LIVE TABLE.

The definitions of each tables and when to use them are in the docs here.

Mado
Valued Contributor II

Thanks @Landan George​ 

Since "streaming_silver" is a streaming live table, I expected the last line of the code to be:

AS SELECT count(*) FROM STREAM(LIVE.streaming_silver) GROUP BY user_id

But, as you can see the "live_gold" is defined by:

AS SELECT count(*) FROM LIVE.streaming_silver GROUP BY user_id

LandanG
Databricks Employee
Databricks Employee

@Mohammad Saber​ I think I see what you're saying. You only need to include the STREAM() function when declaring a STREAMING LIVE TABLE, it doesn't matter the source of data.

So since the live_gold table is just a LIVE TABLE and not a STREAMING LIVE TABLE you do not need to include that function. Hopefully, that makes sense

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group