Incremental Loads from a Catalog/DLT

lprevost · ‎06-26-2024

The databricks guide outlines several use cases for loading data via delta live tables:

https://docs.databricks.com/en/delta-live-tables/load.html

This includes autoloader from cloudfiles, kafka messages, small static datasets, etc.

But, one use case it doesn't cover is incremental loading from a hive table catalog where those partitions are being incrementally added by another process.. This table points to partitioned raw csv files which are very large (Terabytes) of incrementally growing data. The partitions are added as the source data are added.

I'd like to create a DLT pipeline that handles this similar to how it would via autoloader using triggers available now and maxbytes parameters to ingest it a batch at a time. so, instead of checkpointing filenames which is how autoloader would handle this, I would prefer to checkpoint partitions and batch load the new partition after initially loading them all.

I've considered just trying:

spark.readstream.table("my_big_catalog_table").options("maxbytes" : "20g").trigger("availablenow") to use pseudocode. But, I'm unsure how the checkpointing would work.

I realize that I could use autoloader to read the source csv files but my problem in this case is the source files are in a deeply nested directory structure and the logic to read them via globbing patterns is complex and has already been solved with another process that updates the catalog table partitions and source directory information.