We are using the Databricks Visual Studio Plugin to write our python / spark code.We are using the upload file to databricks functionality because our organisation has turned unity catelog off.
We are now running into a weird bug with custom modules. I have written a custom module in python which uses the parallelisation functionality. But when I haven't created / uploaded a .whl from my module yet, my main code is not executing properly. The problem seems like the workers can't seem to find my module code in the directory. The weird thing is that the driver actually sees the code and i able to run abstract methods from my module just fine (if it is just the driver). To reproduce this error i have set up a small project (see code below):
#################
###File 1########
#################
#the .main file (file 1)
from testbug import *
data = [2,4,6,8,10]
#this works fine
DBMultiplyTest.multiply_value(6)
#this gives an error module testbug not found
rdd = spark.sparkContext.parallelize(data)
result = rdd.map(DBMultiplyTest.multiply_value).collect()
print(result)
#################
###File 2########
#################
from .multiply import DBMultiplyTest
##This is the init file. located in testbug/__init__.py (file 2)
__all__ = ["DBMultiplyTest"]
#################
###File 3########
#################
##just a simple multipy class in file 3 testbug/multiply.py
class DBMultiplyTest:
@staticmethod
def multiply_value(x):
return x*2
When i upload this code to my repo and run this code from the Databricks Web Interface it just runs fine and i get the expected results. There seems to be a difference in how the Visual Studio plugin runs the code compared to how the Databricks interface executes the code.
Anyone has an idea how to fix this or is this just a bug in the plugin?