cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Getting an error "You cannot use dbutils within a spark job"

skosaraju
New Contributor III

I am using mlflow to register my custom model using a simple code below. The DatabricksParams extracts all the params from dbutils and sets the params dictionary and dbutils is not used anywhere else within the rest of my code base. The code fails when I call the mlflow.pyfunc.log_model(). Can you please help what might be causing this?

Exception:

An unexpected error occurred: You cannot use dbutils within a spark job You cannot use dbutils within a spark job or otherwise pickle it. If you need to use getArguments within a spark job, you have to get the argument before using it in the job. For example, if you have the following code: myRdd.map(lambda i: dbutils.args.getArgument("X") + str(i)) Then you should use it this way: argX = dbutils.args.getArgument("X") myRdd.map(lambda i: argX + str(i))

Code Snippet:

# COMMAND ----------
if __name__ == '__main__':
params = DatabricksParams(dbutils)
run_training(params)
1 ACCEPTED SOLUTION

Accepted Solutions

lingareddy_Alva
Honored Contributor II

@skosaraju 

When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:

# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}

# Pass the extracted parameters instead of dbutils
run_training(params_dict)

Then modify your DatabricksParams class to accept a dictionary instead of dbutils:

class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization

# Your existing methods that use self.params

 

LR

View solution in original post

5 REPLIES 5

lingareddy_Alva
Honored Contributor II

@skosaraju 

When MLflow logs a PyFunc model, it needs to serialize (pickle) the model and its dependencies. The error occurs because dbutils is not serializable and cannot be pickled. Even though you're only using it in the DatabricksParams class to extract parameters, the entire instance of DatabricksParams (including any references to dbutils) might be getting captured in the closure and MLflow is trying to serialize it.
The key fix is to extract the parameters from dbutils before passing them to your model training function, rather than passing the dbutils object itself. Here's how to modify your code:

# COMMAND ----------
if __name__ == '__main__':
# Extract parameters as a dictionary BEFORE passing to your training function
params_dict = {
"param1": dbutils.widgets.get("param1") if dbutils.widgets.get("param1") else "default_value1",
"param2": dbutils.widgets.get("param2") if dbutils.widgets.get("param2") else "default_value2",
# Add other parameters as needed
}

# Pass the extracted parameters instead of dbutils
run_training(params_dict)

Then modify your DatabricksParams class to accept a dictionary instead of dbutils:

class DatabricksParams:
def __init__(self, params_dict):
self.params = params_dict
# Any other initialization

# Your existing methods that use self.params

 

LR

@lingareddy_Alva  Thanks for your response.

But, I exactly doing what you explained. The dbutils.widgets are all extracted in the DatabricksParams and returns the params dictionary. This dictionary params is passed to my subsequent methods.

Please see my code below:

class DatabricksParams:

def __init__(self, dbutils):
self.dbutils = dbutils
self.params = {}
self._load_params()

def _load_params(self):
try:
# Load widget values
widgets = self.dbutils.notebook.entry_point.getCurrentBindings()
for key in widgets:
self.params[key] = self.dbutils.widgets.get(key)

# Load notebook context
notebook_info = json.loads(self.dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
self.params['job_id'] = notebook_info["tags"]["jobId"]
self.params['job_run_id'] = notebook_info["tags"]["jobRunId"]
logger.info(f"Loaded parameters: {self.params}")

except Exception as e:
raise RuntimeError(f"Error loading parameters: {e}")

def get_param(self, key, default=None):
return self.params.get(key, default)

skosaraju
New Contributor III

Ah.. I get it now. The params is an instance of the DatabricksParams which contains the dbutils..

Let me try fixing this. Will let you know if this worked.

skosaraju
New Contributor III

@lingareddy_Alva ,

Thank you.. I was able to resolve the error.

lingareddy_Alva
Honored Contributor II

Thanks for the update @skosaraju .

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now