Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

How does mlflow determine if a pyfunc model uses SparkContext?

Valued Contributor

I've been getting this error pretty regularly while working with mlflow:

"It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."

I have a class that extends the mlflow.pyfunc.PythonModel. It has a method that is used to train the data (so not used in prediction) that takes a spark dataframe and applies some filters to get the training dataset. only when I remove this function does the model save.

I was just wondering how mlflow determines whether a class accesses the spark context.

edit: this is really frustrating. it feels like mlflow is designed not to work with data robot time aware modeling.


Not applicable

I checked the page and it looks like there is no integration with Datarobot and Datarobot doesn't contribute to mlflow. has all the integrations listed

Oh sorry I was meaning the way datarobot's prediction api accepts data and returns data for time aware modeling. It needs rows of data as the input, and spits out different rows of data (so maybe 60 rows full of features over time, and it predicts 14 days). Every single tool in mlflow seems geared around "one row => one prediction based upon values solely in that row"

I feel like I'm working against the grain trying to reconcile the two.

okay @Joseph Kambourakis​ I think I have found a workaround for the above problem. The predictions themselves weren't bad because I can always just use the model.predict function It would be nice to have access to the feature store for my data structure, and the model registry boilerplate code doesn't work structurally either.

The big problem I was having is that the data fetching process is really precise (like five where conditions and several joins). I made a single field key for those filters. It's not great, but it does reduce the complexity a hair on the prediction side.

I'm still looking for the answer to how Databricks determines if a model accesses the spark context.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!