Databricks Community

skosaraju · ‎05-11-2025

I am using mlflow to register my custom model using a simple code below. The DatabricksParams extracts all the params from dbutils and sets the params dictionary and dbutils is not used anywhere else within the rest of my code base. The code fails when I call the mlflow.pyfunc.log_model(). Can you please help what might be causing this?

Exception:

An unexpected error occurred: You cannot use dbutils within a spark job You cannot use dbutils within a spark job or otherwise pickle it. If you need to use getArguments within a spark job, you have to get the argument before using it in the job. For example, if you have the following code: myRdd.map(lambda i: dbutils.args.getArgument("X") + str(i)) Then you should use it this way: argX = dbutils.args.getArgument("X") myRdd.map(lambda i: argX + str(i))

Code Snippet:

# COMMAND ----------
if __name__ == '__main__':
    params = DatabricksParams(dbutils)
    run_training(params)

lingareddy_Alva · ‎05-11-2025

@skosaraju

When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:

# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}

# Pass the extracted parameters instead of dbutils
run_training(params_dict)

Then modify your DatabricksParams class to accept a dictionary instead of dbutils:

class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization

# Your existing methods that use self.params

LR

View solution in original post

lingareddy_Alva · ‎05-11-2025

@skosaraju

When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:

# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}

# Pass the extracted parameters instead of dbutils
run_training(params_dict)

Then modify your DatabricksParams class to accept a dictionary instead of dbutils:

class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization

# Your existing methods that use self.params

LR

skosaraju · ‎05-11-2025

@lingareddy_Alva Thanks for your response.

But, I exactly doing what you explained. The dbutils.widgets are all extracted in the DatabricksParams and returns the params dictionary. This dictionary params is passed to my subsequent methods.

Please see my code below:

class DatabricksParams:

def __init__(self, dbutils):
self.dbutils = dbutils
self.params = {}
self._load_params()

def _load_params(self):
try:
# Load widget values
widgets = self.dbutils.notebook.entry_point.getCurrentBindings()
for key in widgets:
self.params[key] = self.dbutils.widgets.get(key)

# Load notebook context
notebook_info = json.loads(self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
self.params['job_id'] = notebook_info["tags"]["jobId"]
self.params['job_run_id'] = notebook_info["tags"]["jobRunId"]
logger.info(f"Loaded parameters: {self.params}")

except Exception as e:
raise RuntimeError(f"Error loading parameters: {e}")

def get_param(self, key, default=None):
return self.params.get(key, default)