cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

pyspark dataframe empties after it has been saved to delta lake.

Christine
Contributor II

Hi,

I am facing a problem that I hope to get some help to understand.

I have created a function that is supposed to check if the input data already exist in a saved delta table and if not, it should create some calculations and append the new data to the table.

When I run the code without saving the data, it is possible to display the dataframe (display(dataframe)), however, after appending the data from the dataframe to the delta table, a new run of display(dataframe) suddently indicates that the dataframe is empty. Can somebody help me understand why the dataframe is displayed as empty, when the only change is that the data has been saved to the delta table? Does "display" somehow run the join function again?

Thank you!

Simplified code

#Load existing delta table

  deltaDF = spark.read.format('delta').load(filePath)

   

  #Remove any row that is identical to already existing data

  condition = [<relevant column comparisons>]

  noexistingDF = DF.join(deltaDF, on=condition, how="left_anti")

  

#Performing some additional columns to the data based on the already present data

display(noexistingDF ) #successfullly displays data

#Saving data to delta table

noexistingDF.write.format("delta").mode("append").save(fileDestination)

display(noexistingDF ) #Suddenly the dataframe is empty

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.

So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.

if you would read/write from different tables, the df would not be empty (as the read table has not changed).

View solution in original post

9 REPLIES 9

-werners-
Esteemed Contributor III

yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.

So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.

if you would read/write from different tables, the df would not be empty (as the read table has not changed).

Christine
Contributor II

Okay, thank you! Do you know if there is a way to copy the table to work around it so display does not transform based on the read table but displays the data as it was before saving?

-werners-
Esteemed Contributor III

There are several ways.

But they all come down to the same: writing the df to disk.

So if you write noexistingDF to disk (by spark.write or checkpoint) and then read it, you're there.

Copying the delta table itself seems overkill (althoug it can be done).

mayur_05
New Contributor II

But what if i want to do some transformation after writing noexisting df into tables and use that df into my code 

Anonymous
Not applicable

Hi @Christine Pedersen​ please let us know if @Werner Stinckens​ answered helped you in mitigating the issue or do you need any further help on this?

Christine
Contributor II

Hi @Chetan Kardekar​, the replies did answer my question, so I do not need more information, thank you.

Anonymous
Not applicable

Hey there @Christine Pedersen​ 

Hope everything is going great!

Would you be happy to circle back and mark an answer as best? It would be really helpful for the other members to find the solution more quickly.

Cheers!

Christine
Contributor II

Hey @Vartika Nain​ 

Yes of course.

Cheers!

SharathE
New Contributor III

Hi,im also having similar issue ..does creating temp view and reading it again after saving to a table works??

 

/

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group