cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog Shared compute Issues

Zume
New Contributor II

Am I the only one experiencing challenges in migrating to Databricks Unity Catalog? I observed that in Unity Catalog-enabled compute, the "Shared" access mode is still tagged as a Preview feature. This means it is not yet safe for use in production workloads. Having a compute resource that can be shared in production is crucial because various developers and service principals need to be able to execute queries on the cluster. I'm wondering how others are working around this issue since it is a major blocker to effectively migrating all workloads to Unity Catalog.

Additionally, when I tested my code using the Shared access mode compute, I noticed that it gets stuck when trying to read a file stored in an external location into a data frame..
Watch this video for the demo of the issue https://www.youtube.com/watch?v=J1bn6P7elKI&ab_channel=AfroInfoTech

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @Zume

  • You’re right that the “Shared” access mode is currently in Preview. While it’s not recommended for production workloads, there are ways to work around this limitation:
    • Workspace ACLs (Access Control Lists): Use workspace ACLs to control access to the shared compute resources. This allows you to restrict who can execute queries on the cluster.
    • Service Principals: Consider using service principals with specific permissions to access the shared compute. This way, you can ensure that only authorized users and services can utilize the cluster.
    • Scheduled Access: Limit the shared compute’s availability to specific time windows when needed. This reduces the risk of contention during peak usage.
    • Monitor and Optimize: Regularly monitor the shared compute’s performance and resource utilization. Optimize queries and resource allocation to avoid bottlenecks.
  • When reading files from external locations into a DataFrame, ensure that the file paths are correctly specified.
  • Verify that the external location (e.g., Azure Blob Storage, AWS S3) is accessible from your Databricks cluster.
  • Check for any network-related issues or authentication problems.
  • Review the code and consider using the appropriate file format (e.g., Parquet, CSV, JSON) based on your data source.
  • Remember that Unity Catalog offers centralized control, auditing, lineage, and data discovery, makin...1.
  • If you encounter specific issues, consult for further guidance. And thank you for sharing the video demo—I’ll take a look! 🚀🔍

jacovangelder
Contributor III

Have you tried creating a volume on top of the external location, and using the volume in spark.read.parquet?

i.e. 

 

 

spark.read.parquet('/Volumes/<volume_name>/<folder_name>/<file_name.parquet>')

 

 

Edit: also, not sure why the Databricks community manager here said Shared access mode is "in preview" and that its "not recommended for production workloads", because this is completely false. It is not in preview and completely safe for production workloads. It has been for almost 2 years. The only thing in preview for shared access mode clusters right now are scala workloads. 

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!