cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark dataframe performing poorly

qwerty3
New Contributor III

I have huge datasets, transformation, display, print, show are working well on this data when read in a pandas dataframe. But the same dataframe when converted to a spark dataframe, is taking minutes to display even a single row and hours to write the data in a delta table.

21 REPLIES 21

gchandra
Databricks Employee
Databricks Employee

Can you please share the code snippet?



~

qwerty3
New Contributor III
df.write.format("delta").saveAsTable("test_db.test_spark_df")
This took 3 mins to get completed with 5 rows and 4 columns
 
Actual datasets are not even getting written

qwerty3
New Contributor III
df.write.format("delta").saveAsTable("test_db.test_spark_df")
This took 3 mins to get completed with 5 rows and 4 columns
 
Actual datasets are not even getting written

gchandra
Databricks Employee
Databricks Employee

3 mins to write 5 rows is no good.

Are you running this on a shared cluster with so many other jobs? Will it be possible to test this on a personal cluster to isolate the issue?

try displaying the data frame in one cell display(df) and save the data frame in another cell. 



~

qwerty3
New Contributor III

The cluster that I was using to execute this was not performing any other tasks, although the azure quota for this cluster family cpu was 83% at the time, I created a new cluster belonging to a family which had all the cores available, there spark is working well. But even at 83% utilization, should that cluster (the one used earlier, with high memory) perform so poorly?

gchandra
Databricks Employee
Databricks Employee

It's good to hear it worked on the new cluster family.

If the quota is already at 83%, the number of nodes your cluster needs is important. If Azure is not able to provision that many resources, it could result in suboptimal performance.

To find out this, please reduce the number of nodes so your cluster can start the job and complete it.



~

qwerty3
New Contributor III

Earlier I was using EA family of clusters which were memory optimized, now when I shifted to general purpose compute, the same data is getting written in seconds. Is it that the EA family of memory optimized clusters are not very performant for spark operations?

gchandra
Databricks Employee
Databricks Employee

For processing 5 rows, EA vs. Non-EA doesn't matter.

As you mentioned before, it could be non availability of the cluster in the quota. 



~

qwerty3
New Contributor III

But even with General Purpose compute (256 GB memory, 64 cores, 8 max worker nodes, working solely on one task, i.e. one notebook) I am not able to write one dataframe as delta table, it contains geospatial data and must have data in lakhs

gchandra
Databricks Employee
Databricks Employee

It could be Skew, your partition anything.

Without looking at the script, and knowing the schema, number of rows, and output of Spark UI, it's hard to say what is wrong.



~

qwerty3
New Contributor III

qwerty3_0-1727365368492.png

 

gchandra
Databricks Employee
Databricks Employee

🙂 count() is just the action. 

What are the transformations you are doing in the data frame? How many columns, how many rows you are anticipating approximately?



~

qwerty3
New Contributor III

I want to write that data in a table, but it always get stuck, it has 12 columns, the task was getting stuck, that is why I wanted to see count of data

gchandra
Databricks Employee
Databricks Employee

One last time, please share the entire script of the data frame so I can see how I can help.



~

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group