Databricks

Octavian1 · ‎02-21-2024

I am trying to find a way to locally download the model artifacts that build a chatbot chain registered with MLflow in Databricks, so that I can preserve the whole structure (chain -> model -> steps -> yaml & pkl files).

There is a mention in a contributed article, but it is not clear what `local_dir` really represents (inside dbfs, in the volume, on the local computer?) and what format it is supposed to have.

Maybe somebody knows the answer 🙂

Thx

Octavian1 · ‎02-26-2024

OK, eventually I found a solution. I write it below, whether somebody will need it. Basically, if in the download_artifacts method the local directory is an existing and accessible one in the DBFS, the process will work as expected.

import os 
# Consider you have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "/dbfs/FileStore/mydir1" # existing and accessible DBFS folder
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_path))

# expected output print message: Artifacts downloaded in: /dbfs/FileStore/mydir1/chain

View solution in original post

Kaniz · ‎02-21-2024

Hi @Octavian1, When working with MLflow in Databricks, you can download model artifacts to your local storage using the client.download_artifacts method.

Let me explain how it works:

By default, MLflow saves artifacts to an artifact store URI during an experiment. The artifact store URI follows a structure like /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. However, this artifact store is managed by MLflow, and you cannot directly download artifacts from it.
To download artifacts, you must use the client.download_artifacts method. This method allows you to copy artifacts from the artifact store to another storage location of your choice. You specify the local directory (local_dir) where you want to store the downloaded artifacts.

Here’s an example code snippet in Python that demonstrates how to download MLflow artifacts from a specific run and store them locally:

import mlflow
import os
from mlflow.tracking import MlflowClient

# Initialize MLflow client
client = MlflowClient()

# Specify the local directory where you want to store artifacts
local_dir = "<local-path-to-store-artifacts>"

# Create the local directory if it doesn't exist
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

# Assume you have logged an artifact named "features.txt" during an MLflow run
features = "rooms, zipcode, median_price, school_rating, transport"
with open("features.txt", 'w') as f:
    f.write(features)

# Create a sample MLflow run and log the artifact "features.txt"
with mlflow.start_run() as run:
    mlflow.log_artifact("features.txt", artifact_path="features")

# Download the artifact to local storage
local_path = client.download_artifacts(<run-id>, "features", local_dir)
print(f"Artifacts downloaded in: {local_dir}")

After downloading the artifacts to your local storage, you can further copy or move them to an external filesystem or a mount point using standard tools. For example:
- To copy to an external filesystem (e.g., HDFS), use %scala dbutils.fs.cp(local_dir, "<filesystem://path-to-store-artifacts>").
- To move to a mount point (e.g., Azure Blob Storage), use shutil.move(local_dir, "/dbfs/mnt/<path-to-store-artifacts>").

Remember to replace <local-path-to-store-artifacts> it with your desired local directory and <run-id> with the actual run ID of your specified MLflow run. This way, you can preserve the entire structure of your chatbot chain, including models, steps, and associated files. 🤖📦

For more details, you can refer to the official Databricks documentation on downloading MLflow artifacts. If you have any further questions, feel free to ask! 😊

Octavian1 · ‎02-21-2024

Hi @Kaniz and thank you for your answer.

So I have run this piece of code from a Databricks notebook within my workspace.

Literally:

import os 
# Consider I have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "mydir"
os.makedirs(local_dir, exist_ok=True)
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_dir))

It runs OK, with the expected output:

Artifacts downloaded in: mydir

The question is, where was mydir created? I cannot find it anywhere (workspace, dbfs, volume...)

Thank you!

Kaniz · ‎02-21-2024

Hi @Octavian1, The directory “mydir” that you specified in your code is created within the Databricks workspace. However, it’s important to understand that this directory is not directly accessible from your local machine or the DBFS (Databricks File System).

Let me explain further:

Workspace Location:
- When you create a directory using os.makedirs(local_dir, exist_ok=True) in your Databricks notebook, it is created within the Databricks workspace.
- The Databricks workspace is a managed environment where you develop and run your notebooks, jobs, and experiments.
- The directory “mydir” exists within the Databricks workspace, but it’s not visible in your local filesystem or DBFS.
Accessing Artifacts:
- The artifacts you downloaded using client.download_artifacts are stored in the Databricks artifact store, which is managed by MLflow.
- The path you specified for downloading artifacts ("chain") corresponds to the artifact path within the run identified by <run_id>.
- These artifacts are not directly accessible in your local filesystem or DBFS unless you explicitly move or copy them.
Viewing Artifacts:
- To view the downloaded artifacts, you can navigate to the Artifacts tab within the specific MLflow run in the Databricks workspace.
- From there, you can explore the contents of the “chain” directory and access individual files.
Copying or Moving Artifacts:
- If you want to access these artifacts outside of Databricks, you can use standard Databricks utilities to copy or move them to a different location.
- For example:
  - To copy to an external filesystem (Scala):
```
dbutils.fs.cp(local_dir, "file:/mnt/<mount-point>/<path-to-store-artifacts>")
```
  - To move to a DBFS directory (Python):
```
dbutils.fs.mv(local_dir, "/dbfs/mnt/<path-to-store-artifacts>")
```

Remember that the “mydir” directory is a temporary workspace location within Databricks, and you’ll need to take additional steps to make the artifacts accessible in other environments. If you have specific requirements for where you want to store the artifacts, consider using an appropriate mount point or external storage location. 📁🔍🚀

For more details, you can refer to the Databricks documentation on interacting with workspace files ¹.

Octavian1 · ‎02-21-2024

Hi @Kaniz and thanks again.

So in my example the artifacts have been downloaded to the local_path, which is /databricks/driver/mydir/chain
From your second explanation at point 1., it turns out that also this directory is not directly visible/accessible (The directory “mydir” exists within the Databricks workspace, but it’s not visible in your local filesystem or DBFS.)

It seems then that the only way to get them is applying paragraph 4., so I proceeded with:

dbutils.fs.mv(local_dir, "/dbfs/mnt/mypath")

and also tried

dbutils.fs.mv(local_path, "/dbfs/mnt/mypath")

but in both cases, there was an error regarding both local_dir (/mydir) and local_path (/databricks/driver/mydir/chain) that they do not exist (FileNotFound)

Actually you can see that in the first error case, it is shown /mydir (mydir directly under the root), which may not be OK.

In any case, I am still in the same place, I am not able to download the artifacts which I am intending to. 🙃

Octavian1 · ‎02-21-2024

This is really confusing.

I ran:

dbutils.fs.mkdirs("/databricks/driver/mydir")

which gave me the response: True

To check it exists, I ran then:

dbutils.fs.ls("/databricks/driver")

with the response:

[FileInfo(path='dbfs:/databricks/driver/mydir/', name='mydir/', size=0, modificationTime=17...)]

then I executed:

local_path = client.download_artifacts(run_id, "chain", "mydir")
print("Artifacts downloaded in: {}".format(local_path))

with the response:

Artifacts downloaded in: /databricks/driver/mydir/chain

Eventually I ran:

dbutils.fs.ls("/databricks/driver/mydir")

with the result: []

What means that actually no artifacts were downloaded, or am I missing something?

Kaniz · ‎02-22-2024

Hi @Octavian1, I apologize for the confusion you’re experiencing.

Let’s break down the steps and troubleshoot the issue:

Creating the Directory:
- You successfully created the directory “/databricks/driver/mydir” using dbutils.fs.mkdirs("/databricks/driver/mydir").
- The response True indicates that the directory was created.
Listing Contents of “/databricks/driver”:
- When you ran dbutils.fs.ls("/databricks/driver"), it showed that the directory “mydir” exists within “/databricks/driver”.
- The response FileInfo(path='dbfs:/databricks/driver/mydir/', name='mydir/', size=0, modificationTime=17...) confirms its existence.
Downloading Artifacts:
- You used client.download_artifacts(run_id, "chain", "mydir") to download artifacts from the specified run.
- The response “Artifacts downloaded in: /databricks/driver/mydir/chain” indicates that the artifacts were successfully downloaded to that location.
Listing Contents Again:
- However, when you ran dbutils.fs.ls("/databricks/driver/mydir") again, it returned an empty result.
- This suggests that the artifacts might not have been saved within the “/databricks/driver/mydir” directory.
Possible Issue:
- The issue could be related to the artifact path specified during the download.
- Ensure that the artifact path “chain” exists within the specific MLflow run identified by <run-id>.
Double-Check Artifact Path:
- Verify that the artifact path “chain” is correct for the specific run.
- You can navigate to the Artifacts tab within the MLflow run in the Databricks workspace to confirm the artifact structure.
Copying Artifacts to DBFS:
- If the artifacts are indeed downloaded to “/databricks/driver/mydir/chain”, you can copy them to DBFS using the following command:
```
dbutils.fs.cp("file:/databricks/driver/mydir/chain", "dbfs:/mnt/mypath")
```
  Replace “/mnt/mypath” with the actual DBFS path where you want to store the artifacts.
Verify in DBFS:
- After copying, navigate to the DBFS path (e.g., /dbfs/mnt/mypath) to verify that the artifacts are accessible in DBFS.

Remember that the “mydir” directory is a temporary workspace location within Databricks. By copying the artifacts to DBFS, you’ll make them available for further use. If you encounter any issues during this process, please let me know, and we’ll continue troubleshooting! 🚀🔍📦

For more information, you can refer to the Databricks documentation on [interacting with workspace f...¹.

Octavian1 · ‎02-22-2024

Hi @Kaniz ,

Indeed the artifacts are in

"/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

and I am able to navigate in the UI at the URL mentioned above, where I can see the artifacts.

So I am not sure why the download apparently succeeds (as seen in the method response), but the final result is not the expected one.

All of the rest you wrote is what I had done.

Now I am thinking of an alternative, is it possible to do the same not from the DB notebook, but from a local script?
I am asking because I am not sure what settings I need in place to be able to run

client.download_artifacts(run_id, "chain", "mydir")

As such, I get an error message of not recognizing run_id.

Or can the same operation (download_artifacts) be done by calling a REST API? If yes, which would it be?

Or using the databricks cli?

Thank you!

Octavian1 · ‎02-26-2024

OK, eventually I found a solution. I write it below, whether somebody will need it. Basically, if in the download_artifacts method the local directory is an existing and accessible one in the DBFS, the process will work as expected.

import os 
# Consider you have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "/dbfs/FileStore/mydir1" # existing and accessible DBFS folder
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_path))

# expected output print message: Artifacts downloaded in: /dbfs/FileStore/mydir1/chain

Databricks

Download model artifacts from MLflow

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs