Databricks Community

Mo_menzje · ‎10-24-2024

Hi there!

I am implementing a classifier for classifying documents to their respective healthcare type.

My current setup implements the regular XGBClassifier of which the hyperparameters are to be tuned on my dataset, which is done using Hyperopt. Based on my search space and the feature space, each run takes quite some time to complete which leads me to believe that parallelizing the run to find the optimal set of hyperparameters is the way to go.

The problem boils down to multithreading these training runs, unlike where one would usually require distributed computing due to large data volumes or large models. Before any other questions, am I even looking at the right tool when considering Databricks for this?

Connecting to databricks

A Databricks workspace is already provisioned at another team, to which I am connecting now to test whether this is the right solution. My own project does not have its own Databricks resource on its Azure subscription currently which also introduces problems with dependencies when trying to run the project as a job. However, I noticed that using databricks-connect, the VS Code extension mentions my venv.

Question A:

Does this use my local venv to run the code? If not, what does it mean or what happens to it behind the scenes?

Question B:

When I simplify my code just to test the connection, it passes imports and even creates the ML Flow experiment. However, where can I get insight into the run itself and any output logging?

Parallelising the hyperparameter tuning run

When I manage to connect the workspace to the respective compute resource, the task remains to actually use the Spark workers to each run a training instance with different parameters. I found that Hyperopt provides a SparkTrials API but it appears incompatible:

pyspark.errors.exceptions.base.PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute `sparkContext` is not supported in Spark Connect as it depends on the JVM. If you need to use this attribute, do not use Spark Connect when creating your session. Visit https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession for creating regular Spark Session in detail.

I spent some time trying to find alternatives or delving into the issue but to no success so far. Frankly, the vast number of options is quite overwhelming and difficult to wrap my head around when not being an expert... Therefore any help or suggestions would be greatly appreciated!

Code I am trying to run:

Databricks Community

Paralellizing XGBoost Hyperopt run using Databricks

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences