Databricks Community

madrhr · ‎04-24-2024

I need to execute a .py file in Databricks from a notebook (with arguments which for simplicity i exclude here). For this i am using:

%sh script.py

script.py:

from pyspark import SparkContext

def main():
    sc = SparkContext.getOrCreate()
    print(sc)

if __name__ == "__main__":
    main()

However, i need SparkContext in .py file and its suggested to use SparkContext.getOrCreate() but i get the exception that i need to set a master url.

pyspark.errors.exceptions.base.PySparkRuntimeError: [MASTER_URL_NOT_SET] A master URL must be set in your configuration.

But even if i set the master url, i get another exception. Now whats really weird is that if i run the same .py script directly in Databricks using the little play button it works. It also works if i open a web terminal of the cluster und execute my .py script in this bash shell. So using both approaches it works and i get the SparkContext. However this is obvious not very useful. In the %sh shell and in the web shell, user is root, same working directory and the python env is also not the problem.

The cluster i am using is a single node NC24ads_A100, so only a driver node and no additional worker nodes. I running DBR 14.2 ML and Spark 3.5.0.

Would be very happy to know whats so special about %sh or where my problem is or whats a workaround to execute .py files from a databricks notebooks with arguments and while staying/getting SparkContext.

madrhr · ‎04-29-2024

I got it eventually working with a combination of:

from databricks.sdk.runtime import *
spark.sparkContext.addPyFile("/path/to/your/file")
sys.path.append("path/to/your")

View solution in original post

Yeshwanth · ‎04-24-2024

@madrhr

I think this occurs because one session is initiated within the Python script (.py file), while in the Databricks notebook, we have a pre-configured Spark session. It is important to note that we cannot use more than one Spark session per notebook, and each session should be unique.

madrhr · ‎04-25-2024

Thanks for you answer. Thats also how i understand it. But is there a way to inject or connect to the pre-configured Spark session from within the Python script (.py file)?