cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Download model artifacts from MLflow

Octavian1
Contributor

I am trying to find a way to locally download the model artifacts that build a chatbot chain registered with MLflow in Databricks, so that I can preserve the whole structure (chain -> model -> steps -> yaml & pkl files).

Octavian1_0-1708506098526.png

There is a mention in a contributed article, but it is not clear what `local_dir` really represents (inside dbfs, in the volume, on the local computer?) and what format it is supposed to have.

Maybe somebody knows the answer ๐Ÿ™‚ 

Thx

1 ACCEPTED SOLUTION

Accepted Solutions

Octavian1
Contributor

 

 

OK, eventually I found a solution. I write it below, whether somebody will need it. Basically, if in the download_artifacts method the local directory is an existing and accessible one in the DBFS, the process will work as expected.

import os 
# Consider you have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "/dbfs/FileStore/mydir1" # existing and accessible DBFS folder
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_path))

# expected output print message: Artifacts downloaded in: /dbfs/FileStore/mydir1/chain

View solution in original post

8 REPLIES 8

Kaniz
Community Manager
Community Manager

Hi @Octavian1, When working with MLflow in Databricks, you can download model artifacts to your local storage using the client.download_artifacts method.

Let me explain how it works:

  1. By default, MLflow saves artifacts to an artifact store URI during an experiment. The artifact store URI follows a structure like /dbfs/databricks/mlflow-tracking/<experiment-id>/<run-id>/artifacts/. However, this artifact store is managed by MLflow, and you cannot directly download artifacts from it.

  2. To download artifacts, you must use the client.download_artifacts method. This method allows you to copy artifacts from the artifact store to another storage location of your choice. You specify the local directory (local_dir) where you want to store the downloaded artifacts.

  3. Hereโ€™s an example code snippet in Python that demonstrates how to download MLflow artifacts from a specific run and store them locally:

    import mlflow
    import os
    from mlflow.tracking import MlflowClient
    
    # Initialize MLflow client
    client = MlflowClient()
    
    # Specify the local directory where you want to store artifacts
    local_dir = "<local-path-to-store-artifacts>"
    
    # Create the local directory if it doesn't exist
    if not os.path.exists(local_dir):
        os.mkdir(local_dir)
    
    # Assume you have logged an artifact named "features.txt" during an MLflow run
    features = "rooms, zipcode, median_price, school_rating, transport"
    with open("features.txt", 'w') as f:
        f.write(features)
    
    # Create a sample MLflow run and log the artifact "features.txt"
    with mlflow.start_run() as run:
        mlflow.log_artifact("features.txt", artifact_path="features")
    
    # Download the artifact to local storage
    local_path = client.download_artifacts(<run-id>, "features", local_dir)
    print(f"Artifacts downloaded in: {local_dir}")
    
  4. After downloading the artifacts to your local storage, you can further copy or move them to an external filesystem or a mount point using standard tools. For example:

    • To copy to an external filesystem (e.g., HDFS), use %scala dbutils.fs.cp(local_dir, "<filesystem://path-to-store-artifacts>").
    • To move to a mount point (e.g., Azure Blob Storage), use shutil.move(local_dir, "/dbfs/mnt/<path-to-store-artifacts>").

Remember to replace <local-path-to-store-artifacts> it with your desired local directory and <run-id> with the actual run ID of your specified MLflow run. This way, you can preserve the entire structure of your chatbot chain, including models, steps, and associated files. ๐Ÿค–๐Ÿ“ฆ

For more details, you can refer to the official Databricks documentation on downloading MLflow artifacts. If you have any further questions, feel free to ask! ๐Ÿ˜Š

Hi @Kaniz and thank you for your answer.

So I have run this piece of code from a Databricks notebook within my workspace.

Literally:

import os 
# Consider I have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "mydir"
os.makedirs(local_dir, exist_ok=True)
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_dir))

 It runs OK, with the expected output:

Artifacts downloaded in: mydir

The question is, where was mydir created? I cannot find it anywhere (workspace, dbfs, volume...)

Thank you!

Kaniz
Community Manager
Community Manager

Hi @Octavian1The directory โ€œmydirโ€ that you specified in your code is created within the Databricks workspace. However, itโ€™s important to understand that this directory is not directly accessible from your local machine or the DBFS (Databricks File System).

Let me explain further:

  1. Workspace Location:

    • When you create a directory using os.makedirs(local_dir, exist_ok=True) in your Databricks notebook, it is created within the Databricks workspace.
    • The Databricks workspace is a managed environment where you develop and run your notebooks, jobs, and experiments.
    • The directory โ€œmydirโ€ exists within the Databricks workspace, but itโ€™s not visible in your local filesystem or DBFS.
  2. Accessing Artifacts:

    • The artifacts you downloaded using client.download_artifacts are stored in the Databricks artifact store, which is managed by MLflow.
    • The path you specified for downloading artifacts ("chain") corresponds to the artifact path within the run identified by <run_id>.
    • These artifacts are not directly accessible in your local filesystem or DBFS unless you explicitly move or copy them.
  3. Viewing Artifacts:

    • To view the downloaded artifacts, you can navigate to the Artifacts tab within the specific MLflow run in the Databricks workspace.
    • From there, you can explore the contents of the โ€œchainโ€ directory and access individual files.
  4. Copying or Moving Artifacts:

    • If you want to access these artifacts outside of Databricks, you can use standard Databricks utilities to copy or move them to a different location.
    • For example:
      • To copy to an external filesystem (Scala):
        dbutils.fs.cp(local_dir, "file:/mnt/<mount-point>/<path-to-store-artifacts>")
        
      • To move to a DBFS directory (Python):
        dbutils.fs.mv(local_dir, "/dbfs/mnt/<path-to-store-artifacts>")
        

Remember that the โ€œmydirโ€ directory is a temporary workspace location within Databricks, and youโ€™ll need to take additional steps to make the artifacts accessible in other environments. If you have specific requirements for where you want to store the artifacts, consider using an appropriate mount point or external storage location. ๐Ÿ“๐Ÿ”๐Ÿš€

For more details, you can refer to the Databricks documentation on interacting with workspace files1.

Hi @Kaniz and thanks again.

So in my example the artifacts have been downloaded to the local_path, which is /databricks/driver/mydir/chain
From your second explanation at point 1., it turns out that also this directory is not directly visible/accessible (The directory โ€œmydirโ€ exists within the Databricks workspace, but itโ€™s not visible in your local filesystem or DBFS.)

It seems then that the only way to get them is applying paragraph 4., so I proceeded with:

dbutils.fs.mv(local_dir, "/dbfs/mnt/mypath")

and also tried

dbutils.fs.mv(local_path, "/dbfs/mnt/mypath")

but in both cases, there was an error regarding both local_dir (/mydir) and local_path (/databricks/driver/mydir/chain) that they do not exist (FileNotFound)

Actually you can see that in the first error case, it is shown /mydir (mydir directly under the root), which may not be OK.

In any case, I am still in the same place, I am not able to download the artifacts which I am intending to. ๐Ÿ™ƒ

This is really confusing.

I ran:

dbutils.fs.mkdirs("/databricks/driver/mydir")
which gave me the response: True
To check it exists, I ran then:
dbutils.fs.ls("/databricks/driver")
with the response:
[FileInfo(path='dbfs:/databricks/driver/mydir/', name='mydir/', size=0, modificationTime=17...)]
 
then I executed:

 

local_path = client.download_artifacts(run_id, "chain", "mydir")
print("Artifacts downloaded in: {}".format(local_path))

 

with the response:
Artifacts downloaded in: /databricks/driver/mydir/chain
 
Eventually I ran:
dbutils.fs.ls("/databricks/driver/mydir")

with the result: []

What means that actually no artifacts were downloaded, or am I missing something?

Kaniz
Community Manager
Community Manager

Hi @Octavian1I apologize for the confusion youโ€™re experiencing.

Letโ€™s break down the steps and troubleshoot the issue:

  1. Creating the Directory:

    • You successfully created the directory โ€œ/databricks/driver/mydirโ€ using dbutils.fs.mkdirs("/databricks/driver/mydir").
    • The response True indicates that the directory was created.
  2. Listing Contents of โ€œ/databricks/driverโ€:

    • When you ran dbutils.fs.ls("/databricks/driver"), it showed that the directory โ€œmydirโ€ exists within โ€œ/databricks/driverโ€.
    • The response FileInfo(path='dbfs:/databricks/driver/mydir/', name='mydir/', size=0, modificationTime=17...) confirms its existence.
  3. Downloading Artifacts:

    • You used client.download_artifacts(run_id, "chain", "mydir") to download artifacts from the specified run.
    • The response โ€œArtifacts downloaded in: /databricks/driver/mydir/chainโ€ indicates that the artifacts were successfully downloaded to that location.
  4. Listing Contents Again:

    • However, when you ran dbutils.fs.ls("/databricks/driver/mydir") again, it returned an empty result.
    • This suggests that the artifacts might not have been saved within the โ€œ/databricks/driver/mydirโ€ directory.
  5. Possible Issue:

    • The issue could be related to the artifact path specified during the download.
    • Ensure that the artifact path โ€œchainโ€ exists within the specific MLflow run identified by <run-id>.
  6. Double-Check Artifact Path:

    • Verify that the artifact path โ€œchainโ€ is correct for the specific run.
    • You can navigate to the Artifacts tab within the MLflow run in the Databricks workspace to confirm the artifact structure.
  7. Copying Artifacts to DBFS:

    • If the artifacts are indeed downloaded to โ€œ/databricks/driver/mydir/chainโ€, you can copy them to DBFS using the following command:
      dbutils.fs.cp("file:/databricks/driver/mydir/chain", "dbfs:/mnt/mypath")
      
      Replace โ€œ/mnt/mypathโ€ with the actual DBFS path where you want to store the artifacts.
  8. Verify in DBFS:

    • After copying, navigate to the DBFS path (e.g., /dbfs/mnt/mypath) to verify that the artifacts are accessible in DBFS.

Remember that the โ€œmydirโ€ directory is a temporary workspace location within Databricks. By copying the artifacts to DBFS, youโ€™ll make them available for further use. If you encounter any issues during this process, please let me know, and weโ€™ll continue troubleshooting! ๐Ÿš€๐Ÿ”๐Ÿ“ฆ

For more information, you can refer to the Databricks documentation on [interacting with workspace f...1.

Hi @Kaniz ,

Indeed the artifacts are in

"/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

 and I am able to navigate in the UI at the URL mentioned above, where I can see the artifacts.

So I am not sure why the download apparently succeeds (as seen in the method response), but the final result is not the expected one.

All of the rest you wrote is what I had done.

Now I am thinking of an alternative, is it possible to do the same not from the DB notebook, but from a local script?
I am asking because I am not sure what settings I need in place to be able to run

client.download_artifacts(run_id, "chain", "mydir")

As such, I get an error message of not recognizing run_id.

Or can the same operation (download_artifacts) be done by calling a REST API? If yes, which would it be?

Or using the databricks cli?

Thank you!

Octavian1
Contributor

 

 

OK, eventually I found a solution. I write it below, whether somebody will need it. Basically, if in the download_artifacts method the local directory is an existing and accessible one in the DBFS, the process will work as expected.

import os 
# Consider you have the artifacts in "/dbfs/databricks/mlflow-tracking/<id>/<run_id>/artifacts/chain"

client = MlflowClient()
local_dir = "/dbfs/FileStore/mydir1" # existing and accessible DBFS folder
run_id = "<run_id>"
local_path = client.download_artifacts(run_id, "chain", local_dir)
print("Artifacts downloaded in: {}".format(local_path))

# expected output print message: Artifacts downloaded in: /dbfs/FileStore/mydir1/chain
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.