Databricks Community

ChristianRRL · ‎01-10-2024

Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?

The original approach I've been considering is to have raw CSV files arrive in a landing dbfs path and ingest it into a bronze `streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a silver `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:

<at-symbol>dlt.table

def bronze_table_name():

return (

spark.readStream.format("cloudFiles")

.option("header", "true")

.option("cloudFiles.format", "csv")

.option("inferSchema", "true")

.option("cloudFiles.partitionColumns", "project_id")

.load(f"{dataset_path}/{table_name}")

.select("*", "_metadata.file_name")

)

Alternatively, I've noticed a slightly different pattern that has bronze as a view rather than a table, and then both dedupping and data type enforcement are handled in the silver table.

I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.

ChristianRRL · ‎01-16-2024

Hey @Retired_mod, thank you for the feedback. I think this is very helpful, and I really like the CDC example. I would like to double-check though, did you mean to send the same link out for both examples? I think the CDC example link applies more to your first paragraph rather than the second (correct me if I'm wrong). If you meant to provide a different link example for the second paragraph can you please edit your last comment or add a new one below? +Also, I can't tell if the last two links are meant to be other pages as well.

Thanks again!

Databricks Community

DLT Medallion Incremental Ingestion Pattern Approach

Connect with Databricks Users in Your Area

Databricks training invests in closing the data + AI skills gap across enterprises

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Data + AI Summit: Call for Presentations

Season's Speedings: Databricks SQL Delivers 4x Performance Boost Over Two Years