cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT Medallion Incremental Ingestion Pattern Approach

ChristianRRL
Contributor III

Hi there, I have a question regarding what would be the "recommended" incremental ingestion approach using DLT to pull raw landing data into bronze and then silver?

 

The original approach I've been considering is to have raw CSV files arrive in a landing dbfs path and ingest it into a bronze `streaming` table (even though it's triggered to run 1-2 times a day). This bronze table would have ALL the raw data ever submitted, regardless of whether it has duplicates or not. Immediately downstream a silver `streaming` table would deduplicate the data and ensure that the data types are set accordingly. Below is code for a single DLT bronze `streaming` table as I'm meaning to ingest it:

 

<at-symbol>dlt.table

def bronze_table_name():

return (

spark.readStream.format("cloudFiles")

.option("header", "true")

.option("cloudFiles.format", "csv")

.option("inferSchema", "true")

.option("cloudFiles.partitionColumns", "project_id")

.load(f"{dataset_path}/{table_name}")

.select("*", "_metadata.file_name")

)

 

Alternatively, I've noticed a slightly different pattern that has bronze as a view rather than a table, and then both dedupping and data type enforcement are handled in the silver table.

 

I would appreciate feedback on this matter. In my instance we have some very big tables, so I'm not sure if/when the second approach would make any sense for us since I'm assuming that a bronze view would take more and more "querying" time as the raw datasets keep growing, whereas with my original approach it would only ever process the new raw data on a daily run rather than querying the entire dataset.

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @ChristianRRL, Your original approach of using a bronze streaming table to ingest raw CSV files and a silver streaming table to de-duplicate the data and enforce data types is a common pattern. This approach is beneficial when dealing with large datasets as it only processes new raw data on ea....

 

However, the alternative approach of using a bronze view and handling both deduplication and data type enforcement in the silver table can also be effective. This approach might be more suitable when the raw datasets are not too large, and the overhead of qu....

 

In terms of incremental ingestion, Change Data Capture (CDC) is a process that identifies and captur.... DLT allows users to ingest CDC data seamlessly using SQL and Python. The APPLY CHANGES INTO operation in DLT pipelines automatically and seamlessly handles out-of-order ....

 

 

Hey @Kaniz_Fatma, thank you for the feedback. I think this is very helpful, and I really like the CDC example. I would like to double-check though, did you mean to send the same link out for both examples? I think the CDC example link applies more to your first paragraph rather than the second (correct me if I'm wrong). If you meant to provide a different link example for the second paragraph can you please edit your last comment or add a new one below? +Also, I can't tell if the last two links are meant to be other pages as well.

Thanks again!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group