cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

How does mlflow determine if a pyfunc model uses SparkContext?

jonathan-dufaul
Valued Contributor

I've been getting this error pretty regularly while working with mlflow:

"It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."

I have a class that extends the mlflow.pyfunc.PythonModel. It has a method that is used to train the data (so not used in prediction) that takes a spark dataframe and applies some filters to get the training dataset. only when I remove this function does the model save.

I was just wondering how mlflow determines whether a class accesses the spark context.

edit: this is really frustrating. it feels like mlflow is designed not to work with data robot time aware modeling.

3 REPLIES 3

Anonymous
Not applicable

I checked the page and it looks like there is no integration with Datarobot and Datarobot doesn't contribute to mlflow. https://mlflow.org/ has all the integrations listed

Oh sorry I was meaning the way datarobot's prediction api accepts data and returns data for time aware modeling. It needs rows of data as the input, and spits out different rows of data (so maybe 60 rows full of features over time, and it predicts 14 days). Every single tool in mlflow seems geared around "one row => one prediction based upon values solely in that row"

I feel like I'm working against the grain trying to reconcile the two.

okay @Joseph Kambourakis​ I think I have found a workaround for the above problem. The predictions themselves weren't bad because I can always just use the model.predict function It would be nice to have access to the feature store for my data structure, and the model registry boilerplate code doesn't work structurally either.

The big problem I was having is that the data fetching process is really precise (like five where conditions and several joins). I made a single field key for those filters. It's not great, but it does reduce the complexity a hair on the prediction side.

I'm still looking for the answer to how Databricks determines if a model accesses the spark context.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.