cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data in dataframe is also getting deleted when we are trying to delete records from underlying table

nikhilprajapati
New Contributor

 

 

Hi , We are trying to load data from a delta table to a dataframe(a copy of original table) . Initially delta table has count 911 . The dataframe in which the data is loaded also has the same count .

Now,  we are deleting some records from the delta table . After deleting the count in the delta table is 878 . However , the dataframe must have 911 records because we loaded data into it before deleting records in the table. But , the data in the dataframe as well gets deleted. We need to keep a copy of the original table before processing. But this is not helping .

nikhilprajapati_1-1701930598953.png

 

nikhilprajapati_2-1701930598960.png

 

nikhilprajapati_3-1701930598967.png

 

nikhilprajapati_4-1701930598974.png

 

2 REPLIES 2

Hkesharwani
Contributor II

Hi, There is a way to retain the copy of data frame, even if the data in underling table is manipulated but that's a memory expensive operation, be careful while using it.

Hkesharwani_0-1715670842086.png

df1 = spark.createDataFrame(df.rdd.map(lambda x: x), schema=df.schema)

Here we are creating a new data frame from the existing data frame with the help of RDD.
RDD is the fundamental data structure in Spark that represents an immutable distributed collection of objects.

Harshit Kesharwani
Self-taught Data Engineer | Seeking Remote Full-time Opportunities

I want to thank you for this, because i was going crazy that my dataframe suddenly became empty.

I used this method to easily delete duplicates from a Unity Catalog table, by de-duplicating with pypsark only the filtered dataframe with duplicates, deleting duplicates from table (including first and all occurrences), and appending the clean dataframe (which before was getting empty as seen in this thread).

 

I can't believe all the answers I got from "professionals" for removing duplicates had to do with OVERWRITING the table. My table is huge, imagine how inneficient that would become just for a couple of hundred rows or less.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!