Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?
The original approach I've been considering is to have raw CSV files arrive in a landing dbfs path and ingest it into a bronze `streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a silver `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:
<at-symbol>dlt.table
def bronze_table_name():
return (
spark.readStream.format("cloudFiles")
.option("header", "true")
.option("cloudFiles.format", "csv")
.option("inferSchema", "true")
.option("cloudFiles.partitionColumns", "project_id")
.load(f"{dataset_path}/{table_name}")
.select("*", "_metadata.file_name")
)
Alternatively, I've noticed a slightly different pattern that has bronze as a view rather than a table, and then both dedupping and data type enforcement are handled in the silver table.
I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.