- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-23-2021 03:06 PM
Hi All,
I have a daily spark job that reads and joins 3-4 source tables and writes the df in a parquet format. This data frame consists of 100+ columns. As this job run daily, our deduplication logic identifies the latest record from each of source tables , joins them and eventually overwrites the existing parquet file.
The question becomes - is there a way to implement the incremental write only in cases of a new record or changes in the values in the existing record of the file.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 04:09 AM
the MERGE functionality of delta lake is what you are looking for.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-24-2021 11:19 AM
Thanks, Appreciate the quick response.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 04:09 AM
the MERGE functionality of delta lake is what you are looking for.
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-merge-into.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-27-2021 02:55 PM
Thanks werners

