yesterday
I have huge datasets, transformation, display, print, show are working well on this data when read in a pandas dataframe. But the same dataframe when converted to a spark dataframe, is taking minutes to display even a single row and hours to write the data in a delta table.
yesterday
Can you please share the code snippet?
yesterday
yesterday
yesterday
3 mins to write 5 rows is no good.
Are you running this on a shared cluster with so many other jobs? Will it be possible to test this on a personal cluster to isolate the issue?
try displaying the data frame in one cell display(df) and save the data frame in another cell.
8 hours ago
The cluster that I was using to execute this was not performing any other tasks, although the azure quota for this cluster family cpu was 83% at the time, I created a new cluster belonging to a family which had all the cores available, there spark is working well. But even at 83% utilization, should that cluster (the one used earlier, with high memory) perform so poorly?
7 hours ago
It's good to hear it worked on the new cluster family.
If the quota is already at 83%, the number of nodes your cluster needs is important. If Azure is not able to provision that many resources, it could result in suboptimal performance.
To find out this, please reduce the number of nodes so your cluster can start the job and complete it.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group