I have the following basic script that works fine using pycharm on my machine.
from pyspark.sql import SparkSession
print("START")
spark = SparkSession \
.Builder() \
.appName("myapp") \
.master('local[*, 4]') \
.getOrCreate()
print(spark)
data = [('James', '', 'Smith', '1991-04-01', 'M', 3000),
('Michael', 'Rose', '', '2000-05-19', 'M', 4000),
('Robert', '', 'Williams', '1978-09-05', 'M', 4000),
('Maria', 'Anne', 'Jones', '1967-12-01', 'F', 4000),
('Jen', 'Mary', 'Brown', '1980-02-17', 'F', -1)
]
columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
print(df)
However when trying to run on a databricks cluster, directly through python script it gives an error.
START Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/Workspace/Repos/***********/sdk_test/tests/snippets/spark_tests.py", line 13, in class SparkTests: File "/Workspace/Repos/*******/sdk_test/tests/snippets/spark_tests.py", line 16, in SparkTests sc = SparkContext.getOrCreate() File "/databricks/spark/python/pyspark/context.py", line 400, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/databricks/spark/python/pyspark/context.py", line 147, in init self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer, File "/databricks/spark/python/pyspark/context.py", line 192, in _do_init raise RuntimeError("A master URL must be set in your configuration") RuntimeError: A master URL must be set in your configuration CalledProcessError: Command 'b'cd ../\n\n/databricks/python3/bin/python -m tests.snippets.spark_tests\n# python -m tests.runner --env=qa --runtime_env=databricks --upload=True --package=sdk\n'' returned non-zero exit status 1.
What am I missing?