topic Re: Incremental write in Data Engineering

Incremental write

Nazar — Thu, 23 Sep 2021 22:06:15 GMT

Hi All,

I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source tables , joins them and eventually overwrites the existing parquet file.

The question becomes - is there a way to implement the incremental write only in cases of a new record or changes in the values in the existing record of the file.

Re: Incremental write

Nazar — Fri, 24 Sep 2021 18:19:24 GMT

Thanks, Appreciate the quick response.

Re: Incremental write

-werners- — Mon, 27 Sep 2021 11:09:08 GMT

the MERGE functionality of delta lake is what you are looking for.

https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-merge-into

Re: Incremental write

Nazar — Mon, 27 Sep 2021 21:55:33 GMT

Thanks werners