performance issues using shared compute access mode in scala

-werners-
Esteemed Contributor III

I created on our dev environment a cluster using the shared access mode, for our devs to use (instead of separate single user clusters).

What I notice is that the performance of this cluster is terrible.  And I mean really terrible: notebook cells without any action, so just dataframe definitions take minutes to complete.  Even though nothing has to be computed (lazy computing in spark).

When I disable shared compute (so change to single user), performance is reasonable again.

Any ideas?
At the moment I am the only user using the cluster, so it can't be the cluster load.

-werners-
Esteemed Contributor III

Thanks for the answer!

It seems that using shared access mode adds overhead.  The nodes/driver are not stressed at all (cpu/ram/network).
We use UC only.
The clusters seems configured correctly (using the same cluster in single user mode changes performance drastically).
Calculating a query plan should not take more than 5 minutes imo.
Physically printing the query plan takes about 40 secs in single user mode, but takes over 5 minutes in shared.
And the only thing that has changed is the access mode.
So my tentative conclusion is that shared mode adds a massive overhead.

prakharcode
New Contributor III

I can confirm this behaviour. To run the same job on shared cluster in "USER_ISOLATION" mode with nothing changes between the job definition or source data, the performance drop is significant. So much so that there needs to be a radical change in how we need to process data.

vr
Valued Contributor

I am experiencing a huge performance difference between shared and dedicated compute with spark.createDataFrame(pandas_df). Same code, same data, but it completes in 6 s on dedicated cluster and takes 6+ minutes on the shared cluster. >60 times difference!