Hi, I am writing my first streaming pipeline and trying to ensure it is setup to work as a "Lakeflow" pipeline.
It is connecting an external Oracle database with some external Azure Blob storage data (all managed in the same Unity Catalog). The pipeline is just a simple join, to create a gold level table for analysts to use.
My first attempt at this looks like:
%sql
CREATE OR REPLACE STREAMING TABLE final_streaming_table_name
USING DELTA
AS
SELECT *final_cols*
FROM oracle_table_1 o1
JOIN (SELECT *some_cols* FROM azure_table) a ON a.join_col = o1.join_col_1
JOIN (SELECT *some_cols* FROM oracle_table_2) o2 ON o2.join_col = o1.join_col_2
But I also saw some other resources saying that I need to specify JOIN STREAM like this:
%sql
CREATE OR REPLACE STREAMING TABLE final_streaming_table_name
USING DELTA
AS
SELECT *final_cols*
FROM STREAM(azure_table) a
JOIN STREAM(oracle_table_1) o1 ON o1.join_col_1 = a.join_col
JOIN STREAM(oracle_table_2) o2 ON o2.join_col = o1.join_col_2
The external Oracle databases have no regular update schedule, and the underlying values can change at any time. I would always like my `final_streaming_table_name` to have the freshest values.
Is one way right? Or would they both work, and if both work what are the benefits or downsides to using a particular strategy?
After experimentation, it appears you cannot do it the first way 😅 Consider this CLOSED.