05-11-2025 04:19 PM
I am using mlflow to register my custom model using a simple code below. The DatabricksParams extracts all the params from dbutils and sets the params dictionary and dbutils is not used anywhere else within the rest of my code base. The code fails when I call the mlflow.pyfunc.log_model(). Can you please help what might be causing this?
Exception:
An unexpected error occurred: You cannot use dbutils within a spark job You cannot use dbutils within a spark job or otherwise pickle it. If you need to use getArguments within a spark job, you have to get the argument before using it in the job. For example, if you have the following code: myRdd.map(lambda i: dbutils.args.getArgument("X") + str(i)) Then you should use it this way: argX = dbutils.args.getArgument("X") myRdd.map(lambda i: argX + str(i))
Code Snippet:
# COMMAND ----------
if __name__ == '__main__':
params = DatabricksParams(dbutils)
run_training(params)
05-11-2025 08:40 PM
When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:
# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}
# Pass the extracted parameters instead of dbutils
run_training(params_dict)
Then modify your DatabricksParams class to accept a dictionary instead of dbutils:
class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization
# Your existing methods that use self.params
05-11-2025 08:40 PM
When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:
# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}
# Pass the extracted parameters instead of dbutils
run_training(params_dict)
Then modify your DatabricksParams class to accept a dictionary instead of dbutils:
class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization
# Your existing methods that use self.params
05-11-2025 08:50 PM
@lingareddy_Alva Thanks for your response.
But, I exactly doing what you explained. The dbutils.widgets are all extracted in the DatabricksParams and returns the params dictionary. This dictionary params is passed to my subsequent methods.
Please see my code below:
class DatabricksParams:
def __init__(self, dbutils):
self.dbutils = dbutils
self.params = {}
self._load_params()
def _load_params(self):
try:
# Load widget values
widgets = self.dbutils.notebook.entry_point.getCurrentBindings()
for key in widgets:
self.params[key] = self.dbutils.widgets.get(key)
# Load notebook context
notebook_info = json.loads(self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
self.params['job_id'] = notebook_info["tags"]["jobId"]
self.params['job_run_id'] = notebook_info["tags"]["jobRunId"]
logger.info(f"Loaded parameters: {self.params}")
except Exception as e:
raise RuntimeError(f"Error loading parameters: {e}")
def get_param(self, key, default=None):
return self.params.get(key, default)
05-11-2025 08:52 PM
Ah.. I get it now. The params is an instance of the DatabricksParams which contains the dbutils..
Let me try fixing this. Will let you know if this worked.
05-12-2025 08:07 AM
Thank you.. I was able to resolve the error.
05-12-2025 09:56 AM
Thanks for the update @skosaraju .
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now