Solved: Re: Incremental write - Databricks Community - 14562

Register to join the community

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi All,

I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source tables , joins them and eventually overwrites the existing parquet file.

The question becomes - is there a way to implement the incremental write only in cases of a new record or changes in the values in the existing record of the file.

1 ACCEPTED SOLUTION

Accepted Solutions

the MERGE functionality of delta lake is what you are looking for.

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge...

View solution in original post

3 REPLIES 3

Thanks, Appreciate the quick response.

the MERGE functionality of delta lake is what you are looking for.

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge...

Thanks werners

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon