topic Re: performance issues using shared compute access mode in scala in Data Engineering

performance issues using shared compute access mode in scala

-werners- — Wed, 21 Feb 2024 15:16:38 GMT

I created on our dev environment a cluster using the shared access mode, for our devs to use (instead of separate single user clusters).

What I notice is that the performance of this cluster is terrible. And I mean really terrible: notebook cells without any action, so just dataframe definitions take minutes to complete. Even though nothing has to be computed (lazy computing in spark).

When I disable shared compute (so change to single user), performance is reasonable again.

Any ideas?
At the moment I am the only user using the cluster, so it can't be the cluster load.

Re: performance issues using shared compute access mode in scala

-werners- — Thu, 22 Feb 2024 13:20:51 GMT

Thanks for the answer!

It seems that using shared access mode adds overhead. The nodes/driver are not stressed at all (cpu/ram/network).
We use UC only.
The clusters seems configured correctly (using the same cluster in single user mode changes performance drastically).
Calculating a query plan should not take more than 5 minutes imo.
Physically printing the query plan takes about 40 secs in single user mode, but takes over 5 minutes in shared.
And the only thing that has changed is the access mode.
So my tentative conclusion is that shared mode adds a massive overhead.

Re: performance issues using shared compute access mode in scala

prakharcode — Tue, 10 Sep 2024 10:00:26 GMT

I can confirm this behaviour. To run the same job on shared cluster in "USER_ISOLATION" mode with nothing changes between the job definition or source data, the performance drop is significant. So much so that there needs to be a radical change in how we need to process data.

Re: performance issues using shared compute access mode in scala

vr — Fri, 13 Jun 2025 22:46:59 GMT

I am experiencing a huge performance difference between shared and dedicated compute with spark.createDataFrame(pandas_df). Same code, same data, but it completes in 6 s on dedicated cluster and takes 6+ minutes on the shared cluster. >60 times difference!