cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

Patient Risk Score based on health history: Unable to create data folder for artifacts in S3 bucket

SreeRam
New Contributor

Hi All,

we're using the below git project to build PoC on the concept of "Patient-Level Risk Scoring Based on Condition History": https://github.com/databricks-industry-solutions/hls-patient-risk

I was able to import the solution into Databricks and run the first module "01-data-prep.py" successfully.

I'm getting the below error in "02-automl-best-model.py" at code line "

input_data_path = mlflow.artifacts.download_artifacts(":
"Error downloading or reading artifact: The following failures occurred while downloading one or more artifacts from dbfs:/databricks/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts".

Upon further research, I've found that process is not able to create the folder "data" in s3 bucket. Can you please provide any insights on why it is not able to create the folder "data" under "artifacts" folder as in below link:
"https://databricks-workspace-stack-c416f-bucket.s3.ap-southeast-1.amazonaws.com/singapore-prod/27426..."

1 REPLY 1

Louis_Frolio
Databricks Employee
Databricks Employee

Greetings @SreeRam , here are some suggestions for you.

Based on the error you're encountering with the hls-patient-risk solution accelerator, this is a common issue related to MLflow artifact access and storage configuration in Databricks. The problem stems from how MLflow stores artifacts in managed locations and the permissions required to access them.

Root Cause

The issue occurs because MLflow stores artifacts in a managed location (`dbfs:/databricks/mlflow-tracking/`), and direct DBFS access to this location is restricted. When the second notebook tries to download artifacts from the AutoML run, it fails because the "data" folder may not have been created or the artifact download method isn't using the proper MLflow client APIs.

Solutions

Solution 1: Verify MLflow Client Usage

Ensure the code in `02-automl-best-model.py` uses the MLflow client API instead of direct DBFS access:

```python
import mlflow
from mlflow.tracking import MlflowClient

client = MlflowClient()

# Instead of direct artifact download
# Use the MLflow client method
run_id = "<your-automl-run-id>"
local_path = client.download_artifacts(run_id, "data", dst_path="/tmp/artifacts")
```

Solution 2: Configure Custom Artifact Location

Before running the AutoML experiment, set a custom artifact location that points directly to your S3 bucket:

```python
import mlflow

# Set custom artifact location when creating experiment
experiment_name = "/Users/<your-user>/patient-risk"
artifact_location = "s3://databricks-workspace-stack-c416f-bucket/mlflow-artifacts"

try:
experiment_id = mlflow.create_experiment(
experiment_name,
artifact_location=artifact_location
)
except:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id

mlflow.set_experiment(experiment_name)
```

Solution 3: Verify Cluster Instance Profile

Ensure your cluster has the correct IAM instance profile with permissions to write to the S3 bucket:

Required S3 Permissions:
- `s3:PutObject`
- `s3:GetObject`
- `s3:DeleteObject`
- `s3:ListBucket`

To configure:
1. Go to your Databricks workspace
2. Navigate to Compute → Select your cluster
3. Edit configuration → AWS attributes
4. Verify the instance profile has S3 write permissions to your bucket

Solution 4: Alternative Approach - Load Data Directly

If artifact download continues to fail, modify the code to load the training data directly from the source table rather than from MLflow artifacts:

```python
# Instead of downloading artifacts
# Load training data directly from Delta table
training_data = spark.table("catalog.schema.patient_features_table").toPandas()

# Or if using a specific path
training_data = spark.read.format("delta").load("dbfs:/path/to/training/data").toPandas()
```

Solution 5: Check AutoML Run Output

Verify that the AutoML run in notebook 01 actually created the expected artifacts:

```python
from mlflow.tracking import MlflowClient

client = MlflowClient()
run_id = "<automl-run-id>"

# List all artifacts for the run
artifacts = client.list_artifacts(run_id)
for artifact in artifacts:
print(f"Artifact: {artifact.path}")
```

Additional Troubleshooting Steps

1. Verify MLflow version: Ensure you're using MLflow 1.9.1 or above:
```python
%pip install --upgrade mlflow
```

2. Check experiment location: In notebook 01, after the AutoML run completes, verify where artifacts are stored:
```python
run = mlflow.get_run(run_id)
print(f"Artifact URI: {run.info.artifact_uri}")
```

3. Review workspace configuration: If you're using a cross-account S3 bucket, ensure the workspace has proper IAM role assumptions configured.

The most reliable solution is Solution 1 combined with Solution 3, as this uses the proper MLflow client APIs while ensuring your cluster has the necessary permissions to access S3.

 

Hope this helps, Louis.

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now