Databricks

MichaelO · ‎01-28-2022

I built a machine learning model:

lr = LinearRegression()
lr.fit(X_train, y_train)

which I can save to the filestore by:

filename = "/dbfs/FileStore/lr_model.pkl"
with open(filename, 'wb') as f:
    pickle.dump(lr, f)

Ideally, I wanted to save the model directly to a workspace or a repo so I tried:

filename = "/Users/user/lr_model.pkl"
os.makedirs(os.path.dirname(filename), exist_ok=True)
with open(filename, 'wb') as f:
    pickle.dump(lr, f)

but it is not working because the file is not showing up in the workspace.

The only alternative I have now is to transfer the model from the filestore to the workspace or a repo, how do I go about that?

Anonymous · ‎01-28-2022

It's important to keep in mind that there are 2 file systems:

The file system on the local machines that are part of the cluster
The distributed file system https://docs.databricks.com/data/databricks-file-system.html

When you use python w/out spark such as with sklearn, its only on the driver and local is local on the driver. That will go away when the cluster does.

Try %sh ls / and %fs ls and see the differences

View solution in original post

Anonymous · ‎01-28-2022

It's important to keep in mind that there are 2 file systems:

The file system on the local machines that are part of the cluster
The distributed file system https://docs.databricks.com/data/databricks-file-system.html

When you use python w/out spark such as with sklearn, its only on the driver and local is local on the driver. That will go away when the cluster does.

Try %sh ls / and %fs ls and see the differences

Hubert-Dudek · ‎02-01-2022

Workspace and Repo is not full available via dbfs as they have separate access rights. It is better to use MLFlow for your models as it is like git but for ML. I think using MLOps you can than put your model also to git.

Kaniz · ‎02-04-2022

Hi @Michael Okelola ,

When you store the file in DBFS (/FileStore/...), it's in your account (data plane). While notebooks, etc. are in the Databricks account (control plane). By design, you can't import non-code objects into a workspace. But Repos now has support for arbitrary files, although only one direction - you can access files in Repos from your cluster running in the data plane, you can't write into Repos (at least not now). You can:

Either export model to your local disk & commit, then pull changes into Repos
Use Workspace API to put files into Repos. Here is an answer that shows how to do that.

But really, you should use MLflow that is built-in into Azure Databricks, and it will help you by logging the model file, hyper-parameters, and other information. And then you can work with this model using APIs, command tools, etc., for example, to move the model between staging & production stages using Model Registry, deploy the model to AzureML, etc.

Databricks

Transfer files saved in filestore to either the workspace or to a repo

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs