Databricks Community

somedeveloper · ‎11-21-2024

It seems that Databricks is somehow setting the properties of local spark configurations for each notebook. Can someone point me to exactly how and where this is being done?

I would like to set the scheduler to utilize a certain pool by default, but it seems that the property is being set to some dynamic integer value - and ends up using FIFO - rather than just utilizing the default pool. The only solution currently seems to be manually setting the property itself in each notebook either through explicit code for it or importing code that does it, but I'd really prefer something I can make automatic through something like a configuration file.

BigRoux · ‎11-22-2024

You will need to leverage cluster-level Spark configurations or global init scripts. This will allow you to set "spark.scheduler.poo" property automatically for all workloads on the cluster.

You can try navigationg to "Compute", select the cluster you want to modify, click edit then look under "Advanced Options." Expand the "Spark" tab and add "spark.scheduler.pool=<your pool name>".

You can verify your setting by running the following command in a notebook that is attached to your cluster:

spark.sparkContext.setLocalProperty("spark.scheduler.pool", "<your pool name>"

Cheers, Louis.

somedeveloper · ‎11-22-2024

Morning Louis,

I do have the global configuration set to use the scheduler pool currently, but the local property is not mirroring the global property for something reason. The global configuration is setting the scheduler pool to low, but the local property is showing a string of integers similar to: '1944239804544138415', with the value being different for every notebook. I had initially thought of using an init script to set the value, but because init scripts load before Spark is running, I'm unable to set any local properties. I'm unaware of anything we are doing on our side to set dynamically set local values, but if you could please verify that the local settings normally mirror the global settings by default, I'll try checking back with our Databricks team to make sure this isn't something due to our end.

BigRoux · ‎11-23-2024

The behavior you're observing—where the local property for spark.scheduler.pool is being set to a dynamic integer value rather than mirroring the global configuration—is not the default behavior of Spark or Databricks. Normally, global Spark configurations (e.g., set at the cluster level) should propagate to individual sessions unless explicitly overridden.

Is it possible there is a local overide you are not aware of? Everything you are reporting points to something in your environment that is programmaticaly overring your global setting. Local settings will override global settings, it is the pecking order.

Test for local settings: print(spark.sparkContext.getLocalProperty("spark.scheduler.pool"))

Test for global settings: print(spark.conf.get("spark.scheduler.pool"))

You may also want to look for any shared libraries, init scripts, or notebook templates that might include calls to "setLocalProperty"

You can also look at the cluster logs for any evidence of dynamic property assignments.

Cheers, Louis.

Databricks Community

Databricks Setting Dynamic Local Configuration Properties

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April