cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

performance issues using shared compute access mode in scala

-werners-
Esteemed Contributor III

I created on our dev environment a cluster using the shared access mode, for our devs to use (instead of separate single user clusters).

What I notice is that the performance of this cluster is terrible.  And I mean really terrible: notebook cells without any action, so just dataframe definitions take minutes to complete.  Even though nothing has to be computed (lazy computing in spark).

When I disable shared compute (so change to single user), performance is reasonable again.

Any ideas?
At the moment I am the only user using the cluster, so it can't be the cluster load.

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @-werners-Thank you for sharing your experience with the shared access mode cluster in your development environment. Itโ€™s essential to address performance issues promptly, especially when working with Spark clusters.

Letโ€™s explore some potential reasons for the sluggish performance and discuss possible solutions:

  1. Shared Access Mode Limitations:

    • When using shared access mode (USER_ISOLATION), there are certain limitations that might impact performance. For instance:
      • Access Control: Shared clusters provide good user isolation, preventing unauthorized access to data. However, this can lead to additional checks and permissions when accessing files in DBFS or ADLS.
      • Unity Catalog (UC): If youโ€™re using Unity Catalog with shared clusters, consider the following:
        • External Locations: For ADLS data accessed via abfss, create external locations and grant necessary permissions to users.
        • Unity Catalog Volumes: Instead of DBFS, encourage users to use Unity Catalog Volumes for unstructured data, configuration files, and libraries.
    • Ensure that your users have the appropriate permissions to access data sources.
  2. DBFS vs. Unity Catalog:

    • DBFS (Databricks File System) lacks fine-grained access control, while Unity Catalog provides better control over data access.
    • Consider migrating away from DBFS for non-temporary data and leverage Unity Catalog Volumes where possible.
  3. Cluster Configuration:

    • Verify that your cluster configuration is suitable for your workload. You mentioned that youโ€™re the only user currently, so itโ€™s unlikely to be cluster load-related.
    • Check the following settings:
      • Number of Workers: Ensure itโ€™s sufficient for your tasks.
      • Spark Version: Keep it up-to-date.
      • Node Types: Choose an appropriate node type.
      • Spark Environment Variables: Set them appropriately.
      • Data Security Mode: Confirm itโ€™s set to USER_ISOLATION.
      • Other Parameters: Review any other relevant settings.
  4. Monitoring and Profiling:

    • Use Databricks monitoring tools to identify bottlenecks. Check resource utilization, query execution plans, and any long-running tasks.
    • Profile your notebook cells to understand where the delays occur.
  5. Network Latency and Data Movement:

    • If your data resides in remote storage (e.g., ADLS), network latency can impact performance.
    • Optimize data movement and minimize shuffling.
  6. Driver Node Performance:

    • The driver node plays a crucial role in notebook execution. Ensure it has sufficient resources (CPU, memory).
  7. Review Spark Code:

    • Even though you mentioned lazy computing, review your Spark code within the notebook cells. Ensure there are no unintentional expensive operations.
  8. Cluster Restart:

    • Sometimes a cluster restart can resolve performance issues caused by resource fragmentation or other factors.

Remember that shared access mode provides strong user isolation, which is beneficial for security and data control. However, it requires thoughtful management of permissions and data access.

-werners-
Esteemed Contributor III

Thanks for the answer!

It seems that using shared access mode adds overhead.  The nodes/driver are not stressed at all (cpu/ram/network).
We use UC only.
The clusters seems configured correctly (using the same cluster in single user mode changes performance drastically).
Calculating a query plan should not take more than 5 minutes imo.
Physically printing the query plan takes about 40 secs in single user mode, but takes over 5 minutes in shared.
And the only thing that has changed is the access mode.
So my tentative conclusion is that shared mode adds a massive overhead.

prakharcode
Visitor

I can confirm this behaviour. To run the same job on shared cluster in "USER_ISOLATION" mode with nothing changes between the job definition or source data, the performance drop is significant. So much so that there needs to be a radical change in how we need to process data.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group