cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark Merge parquet and delta file

alesventus
New Contributor III

Is it possible to use merge command when source file is parquet and destination file is delta? Or both files must delta files?

Currently, I'm using this code and I transform parquet into delta and it works. But I want to avoid of this tranformation.

Thanks

from delta.tables import *
 
deltaTablePeople = DeltaTable.forPath(spark, 'abfss://destination-delta')
deltaTablePeopleUpdates = DeltaTable.forPath(spark, 'abfss://source-parquet')
 
dfUpdates = deltaTablePeopleUpdates.toDF()
 
deltaTablePeople.alias('people') \
  .merge(
    dfUpdates.alias('updates'),
    'people.id = updates.id'
  ) \
  .whenMatchedUpdate(set =...

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Ales ventus​ , Yes, it is possible to use the merge command when the source file is in Parquet format and the destination file is in Delta format. Delta Lake provides interoperability between different file formats, including Parquet.

You transform the Parquet file into Delta format in your code snippet before performing the merge operation.

However, to avoid this transformation step, you can directly merge the Parquet file with the Delta file without converting it. Delta Lake will handle the compatibility between the two formats.

Here's an updated version of your code to perform the merge operation between a Parquet source file and a Delta destination file:

from delta.tables import *
 
deltaTablePeople = DeltaTable.forPath(spark, 'abfss://destination-delta')
deltaTablePeopleUpdates = DeltaTable.forPath(spark, 'abfss://source-parquet')
 
dfUpdates = deltaTablePeopleUpdates.toDF()
 
deltaTablePeople.alias('people') \
  .merge(
    dfUpdates.alias('updates'),
    'people.id = updates.id'
  ) \
  .whenMatchedUpdate(set=...)
  .whenNotMatchedInsert(values=...)
  .execute()

Make sure to replace set = ... and values = ... with the appropriate update and insert operations you want to perform during the merge.

Remember to include the necessary dependencies and configurations to work with Delta Lake and Parquet files in your Spark environment.

Anonymous
Not applicable

Hi @Ales ventus​ 

We haven't heard from you since the last response from @Kaniz Fatma​ , and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.