Data in dataframe is also getting deleted when we are trying to delete records from underlying table

New Contributor



Hi , We are trying to load data from a delta table to a dataframe(a copy of original table) . Initially delta table has count 911 . The dataframe in which the data is loaded also has the same count .

Now,  we are deleting some records from the delta table . After deleting the count in the delta table is 878 . However , the dataframe must have 911 records because we loaded data into it before deleting records in the table. But , the data in the dataframe as well gets deleted. We need to keep a copy of the original table before processing. But this is not helping .










Contributor II

Hi, There is a way to retain the copy of data frame, even if the data in underling table is manipulated but that's a memory expensive operation, be careful while using it.


df1 = spark.createDataFrame( x: x), schema=df.schema)

Here we are creating a new data frame from the existing data frame with the help of RDD.
RDD is the fundamental data structure in Spark that represents an immutable distributed collection of objects.

I want to thank you for this, because i was going crazy that my dataframe suddenly became empty.

I used this method to easily delete duplicates from a Unity Catalog table, by de-duplicating with pypsark only the filtered dataframe with duplicates, deleting duplicates from table (including first and all occurrences), and appending the clean dataframe (which before was getting empty as seen in this thread).


I can't believe all the answers I got from "professionals" for removing duplicates had to do with OVERWRITING the table. My table is huge, imagine how inneficient that would become just for a couple of hundred rows or less.

