cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

performance issues using shared compute access mode in scala

-werners-
Esteemed Contributor III

I created on our dev environment a cluster using the shared access mode, for our devs to use (instead of separate single user clusters).

What I notice is that the performance of this cluster is terrible.  And I mean really terrible: notebook cells without any action, so just dataframe definitions take minutes to complete.  Even though nothing has to be computed (lazy computing in spark).

When I disable shared compute (so change to single user), performance is reasonable again.

Any ideas?
At the moment I am the only user using the cluster, so it can't be the cluster load.

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @-werners-Thank you for sharing your experience with the shared access mode cluster in your development environment. It’s essential to address performance issues promptly, especially when working with Spark clusters.

Let’s explore some potential reasons for the sluggish performance and discuss possible solutions:

  1. Shared Access Mode Limitations:

    • When using shared access mode (USER_ISOLATION), there are certain limitations that might impact performance. For instance:
      • Access Control: Shared clusters provide good user isolation, preventing unauthorized access to data. However, this can lead to additional checks and permissions when accessing files in DBFS or ADLS.
      • Unity Catalog (UC): If you’re using Unity Catalog with shared clusters, consider the following:
        • External Locations: For ADLS data accessed via abfss, create external locations and grant necessary permissions to users.
        • Unity Catalog Volumes: Instead of DBFS, encourage users to use Unity Catalog Volumes for unstructured data, configuration files, and libraries.
    • Ensure that your users have the appropriate permissions to access data sources.
  2. DBFS vs. Unity Catalog:

    • DBFS (Databricks File System) lacks fine-grained access control, while Unity Catalog provides better control over data access.
    • Consider migrating away from DBFS for non-temporary data and leverage Unity Catalog Volumes where possible.
  3. Cluster Configuration:

    • Verify that your cluster configuration is suitable for your workload. You mentioned that you’re the only user currently, so it’s unlikely to be cluster load-related.
    • Check the following settings:
      • Number of Workers: Ensure it’s sufficient for your tasks.
      • Spark Version: Keep it up-to-date.
      • Node Types: Choose an appropriate node type.
      • Spark Environment Variables: Set them appropriately.
      • Data Security Mode: Confirm it’s set to USER_ISOLATION.
      • Other Parameters: Review any other relevant settings.
  4. Monitoring and Profiling:

    • Use Databricks monitoring tools to identify bottlenecks. Check resource utilization, query execution plans, and any long-running tasks.
    • Profile your notebook cells to understand where the delays occur.
  5. Network Latency and Data Movement:

    • If your data resides in remote storage (e.g., ADLS), network latency can impact performance.
    • Optimize data movement and minimize shuffling.
  6. Driver Node Performance:

    • The driver node plays a crucial role in notebook execution. Ensure it has sufficient resources (CPU, memory).
  7. Review Spark Code:

    • Even though you mentioned lazy computing, review your Spark code within the notebook cells. Ensure there are no unintentional expensive operations.
  8. Cluster Restart:

    • Sometimes a cluster restart can resolve performance issues caused by resource fragmentation or other factors.

Remember that shared access mode provides strong user isolation, which is beneficial for security and data control. However, it requires thoughtful management of permissions and data access.

-werners-
Esteemed Contributor III

Thanks for the answer!

It seems that using shared access mode adds overhead.  The nodes/driver are not stressed at all (cpu/ram/network).
We use UC only.
The clusters seems configured correctly (using the same cluster in single user mode changes performance drastically).
Calculating a query plan should not take more than 5 minutes imo.
Physically printing the query plan takes about 40 secs in single user mode, but takes over 5 minutes in shared.
And the only thing that has changed is the access mode.
So my tentative conclusion is that shared mode adds a massive overhead.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.