cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data in dataframe is also getting deleted when we are trying to delete records from underlying table

nikhilprajapati
New Contributor

 

 

Hi , We are trying to load data from a delta table to a dataframe(a copy of original table) . Initially delta table has count 911 . The dataframe in which the data is loaded also has the same count .

Now,  we are deleting some records from the delta table . After deleting the count in the delta table is 878 . However , the dataframe must have 911 records because we loaded data into it before deleting records in the table. But , the data in the dataframe as well gets deleted. We need to keep a copy of the original table before processing. But this is not helping .

nikhilprajapati_1-1701930598953.png

 

nikhilprajapati_2-1701930598960.png

 

nikhilprajapati_3-1701930598967.png

 

nikhilprajapati_4-1701930598974.png

 

2 REPLIES 2

Hkesharwani
Contributor II

Hi, There is a way to retain the copy of data frame, even if the data in underling table is manipulated but that's a memory expensive operation, be careful while using it.

Hkesharwani_0-1715670842086.png

df1 = spark.createDataFrame(df.rdd.map(lambda x: x), schema=df.schema)

Here we are creating a new data frame from the existing data frame with the help of RDD.
RDD is the fundamental data structure in Spark that represents an immutable distributed collection of objects.

Harshit Kesharwani
Data engineer at Rsystema

I want to thank you for this, because i was going crazy that my dataframe suddenly became empty.

I used this method to easily delete duplicates from a Unity Catalog table, by de-duplicating with pypsark only the filtered dataframe with duplicates, deleting duplicates from table (including first and all occurrences), and appending the clean dataframe (which before was getting empty as seen in this thread).

 

I can't believe all the answers I got from "professionals" for removing duplicates had to do with OVERWRITING the table. My table is huge, imagine how inneficient that would become just for a couple of hundred rows or less.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group