05-24-2022 11:42 PM
Hi,
I am facing a problem that I hope to get some help to understand.
I have created a function that is supposed to check if the input data already exist in a saved delta table and if not, it should create some calculations and append the new data to the table.
When I run the code without saving the data, it is possible to display the dataframe (display(dataframe)), however, after appending the data from the dataframe to the delta table, a new run of display(dataframe) suddently indicates that the dataframe is empty. Can somebody help me understand why the dataframe is displayed as empty, when the only change is that the data has been saved to the delta table? Does "display" somehow run the join function again?
Thank you!
Simplified code
#Load existing delta table
deltaDF = spark.read.format('delta').load(filePath)
#Remove any row that is identical to already existing data
condition = [<relevant column comparisons>]
noexistingDF = DF.join(deltaDF, on=condition, how="left_anti")
#Performing some additional columns to the data based on the already present data
display(noexistingDF ) #successfullly displays data
#Saving data to delta table
noexistingDF.write.format("delta").mode("append").save(fileDestination)
display(noexistingDF ) #Suddenly the dataframe is empty
05-25-2022 12:02 AM
yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.
So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.
if you would read/write from different tables, the df would not be empty (as the read table has not changed).
05-25-2022 12:02 AM
yes. Spark is lazy evaluated meaning that it will execute code only on actions. display() is such an action, write too.
So Spark will first execute the query for the write (read, transform, write), and then do the same a second time for the display (read, transform, display). The left_anti join will return nothing as the data is added to the delta table.
if you would read/write from different tables, the df would not be empty (as the read table has not changed).
05-25-2022 12:25 AM
Okay, thank you! Do you know if there is a way to copy the table to work around it so display does not transform based on the read table but displays the data as it was before saving?
05-25-2022 12:33 AM
There are several ways.
But they all come down to the same: writing the df to disk.
So if you write noexistingDF to disk (by spark.write or checkpoint) and then read it, you're there.
Copying the delta table itself seems overkill (althoug it can be done).
05-09-2024 11:18 AM
But what if i want to do some transformation after writing noexisting df into tables and use that df into my code
05-31-2022 11:57 PM
Hi @Christine Pedersen please let us know if @Werner Stinckens answered helped you in mitigating the issue or do you need any further help on this?
06-01-2022 06:49 AM
Hi @Chetan Kardekar, the replies did answer my question, so I do not need more information, thank you.
07-22-2022 08:09 AM
Hey there @Christine Pedersen
Hope everything is going great!
Would you be happy to circle back and mark an answer as best? It would be really helpful for the other members to find the solution more quickly.
Cheers!
07-31-2022 11:45 PM
Hey @Vartika Nain
Yes of course.
Cheers!
09-23-2023 11:04 AM
Hi,im also having similar issue ..does creating temp view and reading it again after saving to a table works??
/
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group