cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using Delta Live Tables Structured Streaming for small batches?

CVogel
New Contributor

Hi Databricks community, 

I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data). 

I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large. 

Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering? 

Thanks!!

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group