Using Delta Live Tables Structured Streaming for small batches?

Community Platform Discussions

Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.

Hi Databricks community,

I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data).

I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large.

Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering?

Thanks!!