cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT Medallion Incremental Ingestion Pattern Approach

ChristianRRL
Valued Contributor

Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?

 

The original approach I've been considering is to have raw CSV files arrive in a landing dbfs path and ingest it into a bronze `streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a silver `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:

 

<at-symbol>dlt.table

def bronze_table_name():

return (

spark.readStream.format("cloudFiles")

.option("header", "true")

.option("cloudFiles.format", "csv")

.option("inferSchema", "true")

.option("cloudFiles.partitionColumns", "project_id")

.load(f"{dataset_path}/{table_name}")

.select("*", "_metadata.file_name")

)

 

Alternatively, I've noticed a slightly different pattern that has bronze as a view rather than a table, and then both dedupping and data type enforcement are handled in the silver table.

 

I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.

1 REPLY 1

Hey @Retired_mod, thank you for the feedback. I think this is very helpful, and I really like the CDC example. I would like to double-check though, did you mean to send the same link out for both examples? I think the CDC example link applies more to your first paragraph rather than the second (correct me if I'm wrong). If you meant to provide a different link example for the second paragraph can you please edit your last comment or add a new one below? +Also, I can't tell if the last two links are meant to be other pages as well.

Thanks again!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group