Databricks Community

nikhilprajapati · ‎12-06-2023

Hi , We are trying to load data from a delta table to a dataframe(a copy of original table) . Initially delta table has count 911 . The dataframe in which the data is loaded also has the same count .

Now, we are deleting some records from the delta table . After deleting the count in the delta table is 878 . However , the dataframe must have 911 records because we loaded data into it before deleting records in the table. But , the data in the dataframe as well gets deleted. We need to keep a copy of the original table before processing. But this is not helping .

Hkesharwani · ‎05-14-2024

Hi, There is a way to retain the copy of data frame, even if the data in underling table is manipulated but that's a memory expensive operation, be careful while using it.

df1 = spark.createDataFrame(df.rdd.map(lambda x: x), schema=df.schema)

Here we are creating a new data frame from the existing data frame with the help of RDD.
RDD is the fundamental data structure in Spark that represents an immutable distributed collection of objects.

Harshit Kesharwani
Data engineer at Rsystema

Victor_D · ‎06-23-2024

I want to thank you for this, because i was going crazy that my dataframe suddenly became empty.

I used this method to easily delete duplicates from a Unity Catalog table, by de-duplicating with pypsark only the filtered dataframe with duplicates, deleting duplicates from table (including first and all occurrences), and appending the clean dataframe (which before was getting empty as seen in this thread).

I can't believe all the answers I got from "professionals" for removing duplicates had to do with OVERWRITING the table. My table is huge, imagine how inneficient that would become just for a couple of hundred rows or less.

Databricks Community

Data in dataframe is also getting deleted when we are trying to delete records from underlying table

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks