cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

How does mlflow determine if a pyfunc model uses SparkContext?

jonathan-dufaul
Valued Contributor

I've been getting this error pretty regularly while working with mlflow:

"It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."

I have a class that extends the mlflow.pyfunc.PythonModel. It has a method that is used to train the data (so not used in prediction) that takes a spark dataframe and applies some filters to get the training dataset. only when I remove this function does the model save.

I was just wondering how mlflow determines whether a class accesses the spark context.

edit: this is really frustrating. it feels like mlflow is designed not to work with data robot time aware modeling.

3 REPLIES 3

Anonymous
Not applicable

I checked the page and it looks like there is no integration with Datarobot and Datarobot doesn't contribute to mlflow. https://mlflow.org/ has all the integrations listed

Oh sorry I was meaning the way datarobot's prediction api accepts data and returns data for time aware modeling. It needs rows of data as the input, and spits out different rows of data (so maybe 60 rows full of features over time, and it predicts 14 days). Every single tool in mlflow seems geared around "one row => one prediction based upon values solely in that row"

I feel like I'm working against the grain trying to reconcile the two.

okay @Joseph Kambourakis​ I think I have found a workaround for the above problem. The predictions themselves weren't bad because I can always just use the model.predict function It would be nice to have access to the feature store for my data structure, and the model registry boilerplate code doesn't work structurally either.

The big problem I was having is that the data fetching process is really precise (like five where conditions and several joins). I made a single field key for those filters. It's not great, but it does reduce the complexity a hair on the prediction side.

I'm still looking for the answer to how Databricks determines if a model accesses the spark context.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group