cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using Delta Live Tables Structured Streaming for small batches?

CVogel
New Contributor

Hi Databricks community, 

I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data). 

I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large. 

Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering? 

Thanks!!

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now