cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark dataframe empties after it has been saved to delta lake.

Christine
Contributor

Hi,

I am facing a problem that I hope to get some help to understand.

I have created a function that is supposed to check if the input data already exist in a saved delta table and if not, it should create some calculations and append the new data to the table.

When I run the code without saving the data, it is possible to display the dataframe (display(dataframe)), however, after appending the data from the dataframe to the delta table, a new run of display(dataframe) suddently indicates that the dataframe is empty. Can somebody help me understand why the dataframe is displayed as empty, when the only change is that the data has been saved to the delta table? Does "display" somehow run the join function again?

Thank you!

Simplified code

#Load existing delta table

  deltaDF = spark.read.format('delta').load(filePath)

   

  #Remove any row that is identical to already existing data

  condition = [<relevant column comparisons>]

  noexistingDF = DF.join(deltaDF, on=condition, how="left_anti")

  

#Performing some additional columns to the data based on the already present data

display(noexistingDF ) #successfullly displays data

#Saving data to delta table

noexistingDF.write.format("delta").mode("append").save(fileDestination)

display(noexistingDF ) #Suddenly the dataframe is empty

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.

So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.

if you would read/write from different tables, the df would not be empty (as the read table has not changed).

View solution in original post

8 REPLIES 8

-werners-
Esteemed Contributor III

yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.

So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.

if you would read/write from different tables, the df would not be empty (as the read table has not changed).

Christine
Contributor

Okay, thank you! Do you know if there is a way to copy the table to work around it so display does not transform based on the read table but displays the data as it was before saving?

-werners-
Esteemed Contributor III

There are several ways.

But they all come down to the same: writing the df to disk.

So if you write noexistingDF to disk (by spark.write or checkpoint) and then read it, you're there.

Copying the delta table itself seems overkill (althoug it can be done).

Anonymous
Not applicable

Hi @Christine Pedersen​ please let us know if @Werner Stinckens​ answered helped you in mitigating the issue or do you need any further help on this?

Christine
Contributor

Hi @Chetan Kardekar​, the replies did answer my question, so I do not need more information, thank you.

Anonymous
Not applicable

Hey there @Christine Pedersen​ 

Hope everything is going great!

Would you be happy to circle back and mark an answer as best? It would be really helpful for the other members to find the solution more quickly.

Cheers!

Christine
Contributor

Hey @Vartika Nain​ 

Yes of course.

Cheers!

SharathE
New Contributor II

Hi,im also having similar issue ..does creating temp view and reading it again after saving to a table works??

 

/

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.