Using Delta Live Tables Structured Streaming for small batches?

Get Started Discussions

Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.

Hi Databricks community,

I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data).

I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large.

Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering?

Thanks!!

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.