Databricks

Taha_Hussain · ‎09-08-2022

Register for Databricks Office Hours

September 14: 8:00 - 9:00 AM PT | 3:00pm - 4:00pm GMT

September 28: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT

Databricks Office Hours connects you directly with experts to answer your Databricks questions.

Join us to:

• Troubleshoot your technical questions

• Learn the best strategies to apply Databricks to your use case

• Master tips and tricks to maximize your usage of our platform

Register now!

Taha_Hussain · ‎09-08-2022

Check out some of the questions from fellow users during our last Office Hours. All these questions were answered live by a Databricks expert!

Q: What's the best way of using a UDF in a class?

A: You need to define your class and then register the function as a UDF. You can find more examples here https://docs.databricks.com/spark/latest/spark-sql/udf-python.html

Q: We ran into difficulty coaxing spark to distribute computational work for monte carlo simulation, seemed like optimizer would try to run tasks sequentially until we disabled spark.sql.adaptive.coalescePartitions.enabled. Are there best practices for distributing MC sim/computation work vs data-intesive tasks?

A: If you want to have AQE enabled but to tune the application if we need to target minimum # shuffle partitions you can use the below settings. spark.conf.set("spark.sql.adaptive.enabled",True) spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled",True) spark.conf.set('spark.sql.adaptive.coalescePartitions.initialPartitionNum','1440') spark.conf.set('spark.sql.adaptive.coalescePartitions.minPartitionNum', '1000') spark.conf.set("spark.databricks.adaptive.autoOptimizeShuffle.enabled",False) If you don’t set spark.sql.adaptive.coalescePartitions.initialPartitionNum by default it will take spark.sql.shuffle.partitions https://spark.apache.org/docs/latest/sql-performance-tuning.html#performance-tuning The advantage of the above setting is the number of shuffle partitions will always lie between minPartitionNum and initialPartitionNum.

Q: How to run code scan for notebooks and their dependencies defined in a notebook or in a cluster? Code scan like black duck.

A: You will need to create a init script to install these library's dependencies at the time when the cluster is being created.

Q: I've been using cluster with TAC in the past. I've noticed that for one of the new clients I am working within the new Databricks UI it's mentioned that High Concurrent clusters are deprecated. Does it mean we should move to UC and forget about using HC clusters with TAC

A: This is an expected behavior we removed the option for HC clusters. They don't provide any additional behavior that can't be configured via standard clusters these days.

Q: Is %sql CLEAR CACHE and sparkcontext.clearCache() the same? Do they clear the cache in the cluster or in the notebook state?

A: it clears the dataframe/table cached in the session no jvm caches (notebook state)

Q: Specific to Unity Catalog... is it correct to think that Hive Metastore is replaced by Unity Metastore? or are they complementary?

A: That is correct. Unity is the new way of doing things. Much more secure and capable than HMS

Q: As an admin, is there a way to check which table got access by which person via data science and engineer environment?

A: If it is UC, you can check the information schema for this details. Without UC, I am not sure, may be you can explore audit logs option

Databricks

Register for Databricks Office HoursSeptember 14: 8:00 - 9:00 AM PT | 3:00pm - 4:00pm GMTSeptember 28: 11:00 AM - 12:00 PM PT | 6:00 - 7:00 PM GMT Dat...

Supercharge Your Code Generation

Registration now open! Databricks Data + AI Summit 2024

Deploying Third-party models securely with the Databricks Data Intelligence Platform and HiddenLayer

Accelerating the Scientific AI Revolution

Exciting Announcement: Introducing our Learning Library!