Databricks

jonathan-dufaul · ‎11-28-2022

I've been getting this error pretty regularly while working with mlflow:

"It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063."

I have a class that extends the mlflow.pyfunc.PythonModel. It has a method that is used to train the data (so not used in prediction) that takes a spark dataframe and applies some filters to get the training dataset. only when I remove this function does the model save.

I was just wondering how mlflow determines whether a class accesses the spark context.

edit: this is really frustrating. it feels like mlflow is designed not to work with data robot time aware modeling.

Anonymous · ‎11-28-2022

I checked the page and it looks like there is no integration with Datarobot and Datarobot doesn't contribute to mlflow. https://mlflow.org/ has all the integrations listed

jonathan-dufaul · ‎11-28-2022

Oh sorry I was meaning the way datarobot's prediction api accepts data and returns data for time aware modeling. It needs rows of data as the input, and spits out different rows of data (so maybe 60 rows full of features over time, and it predicts 14 days). Every single tool in mlflow seems geared around "one row => one prediction based upon values solely in that row"

I feel like I'm working against the grain trying to reconcile the two.

jonathan-dufaul · ‎11-29-2022

okay @Joseph Kambourakis I think I have found a workaround for the above problem. The predictions themselves weren't bad because I can always just use the model.predict function It would be nice to have access to the feature store for my data structure, and the model registry boilerplate code doesn't work structurally either.

The big problem I was having is that the data fetching process is really precise (like five where conditions and several joins). I made a single field key for those filters. It's not great, but it does reduce the complexity a hair on the prediction side.

I'm still looking for the answer to how Databricks determines if a model accesses the spark context.

Databricks

How does mlflow determine if a pyfunc model uses SparkContext?

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI