<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Patient Risk Score based on health history: Unable to create data folder for artifacts in S3 bucket in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/patient-risk-score-based-on-health-history-unable-to-create-data/m-p/107644#M3930</link>
    <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;we're using the below git project to build PoC on the concept of "Patient-Level Risk Scoring Based on Condition History":&amp;nbsp;&lt;A href="https://github.com/databricks-industry-solutions/hls-patient-risk" target="_blank" rel="noopener"&gt;https://github.com/databricks-industry-solutions/hls-patient-risk&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I was able to import the solution into Databricks and run the first module "&lt;A class="" title="01-data-prep.py" href="https://github.com/databricks-industry-solutions/hls-patient-risk/blob/main/01-data-prep.py" target="_blank" rel="noopener"&gt;01-data-prep.py&lt;/A&gt;" successfully.&lt;/P&gt;&lt;P&gt;I'm getting the below error in "&lt;A class="" title="02-automl-best-model.py" href="https://github.com/databricks-industry-solutions/hls-patient-risk/blob/main/02-automl-best-model.py" target="_blank" rel="noopener"&gt;02-automl-best-model.py&lt;/A&gt;" at code line "&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;input_data_path &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; mlflow.artifacts.&lt;/SPAN&gt;&lt;SPAN&gt;download_artifacts&lt;/SPAN&gt;&lt;SPAN&gt;(":&lt;BR /&gt;"Error downloading or reading artifact: The following failures occurred while downloading one or more artifacts from dbfs:/databricks/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts".&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;Upon further research, I've found that process is not able to create the folder "data" in s3 bucket. Can you please provide any insights on why it is not able to create the folder "data" under "artifacts" folder as in below link:&lt;BR /&gt;"&lt;A href="https://databricks-workspace-stack-c416f-bucket.s3.ap-southeast-1.amazonaws.com/singapore-prod/2742639904931973.jobs/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts/data" target="_blank" rel="noopener"&gt;https://databricks-workspace-stack-c416f-bucket.s3.ap-southeast-1.amazonaws.com/singapore-prod/2742639904931973.jobs/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts/data&lt;/A&gt;"&lt;/P&gt;</description>
    <pubDate>Wed, 29 Jan 2025 17:26:04 GMT</pubDate>
    <dc:creator>SreeRam</dc:creator>
    <dc:date>2025-01-29T17:26:04Z</dc:date>
    <item>
      <title>Patient Risk Score based on health history: Unable to create data folder for artifacts in S3 bucket</title>
      <link>https://community.databricks.com/t5/machine-learning/patient-risk-score-based-on-health-history-unable-to-create-data/m-p/107644#M3930</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;we're using the below git project to build PoC on the concept of "Patient-Level Risk Scoring Based on Condition History":&amp;nbsp;&lt;A href="https://github.com/databricks-industry-solutions/hls-patient-risk" target="_blank" rel="noopener"&gt;https://github.com/databricks-industry-solutions/hls-patient-risk&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I was able to import the solution into Databricks and run the first module "&lt;A class="" title="01-data-prep.py" href="https://github.com/databricks-industry-solutions/hls-patient-risk/blob/main/01-data-prep.py" target="_blank" rel="noopener"&gt;01-data-prep.py&lt;/A&gt;" successfully.&lt;/P&gt;&lt;P&gt;I'm getting the below error in "&lt;A class="" title="02-automl-best-model.py" href="https://github.com/databricks-industry-solutions/hls-patient-risk/blob/main/02-automl-best-model.py" target="_blank" rel="noopener"&gt;02-automl-best-model.py&lt;/A&gt;" at code line "&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;input_data_path &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; mlflow.artifacts.&lt;/SPAN&gt;&lt;SPAN&gt;download_artifacts&lt;/SPAN&gt;&lt;SPAN&gt;(":&lt;BR /&gt;"Error downloading or reading artifact: The following failures occurred while downloading one or more artifacts from dbfs:/databricks/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts".&lt;BR /&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;Upon further research, I've found that process is not able to create the folder "data" in s3 bucket. Can you please provide any insights on why it is not able to create the folder "data" under "artifacts" folder as in below link:&lt;BR /&gt;"&lt;A href="https://databricks-workspace-stack-c416f-bucket.s3.ap-southeast-1.amazonaws.com/singapore-prod/2742639904931973.jobs/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts/data" target="_blank" rel="noopener"&gt;https://databricks-workspace-stack-c416f-bucket.s3.ap-southeast-1.amazonaws.com/singapore-prod/2742639904931973.jobs/mlflow-tracking/dc37a79d57c343e5b875fda7c812586d/72f4739209ff437eaea4c1a4041a570a/artifacts/data&lt;/A&gt;"&lt;/P&gt;</description>
      <pubDate>Wed, 29 Jan 2025 17:26:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/patient-risk-score-based-on-health-history-unable-to-create-data/m-p/107644#M3930</guid>
      <dc:creator>SreeRam</dc:creator>
      <dc:date>2025-01-29T17:26:04Z</dc:date>
    </item>
    <item>
      <title>Re: Patient Risk Score based on health history: Unable to create data folder for artifacts in S3 buc</title>
      <link>https://community.databricks.com/t5/machine-learning/patient-risk-score-based-on-health-history-unable-to-create-data/m-p/137260#M4398</link>
      <description>&lt;P&gt;Greetings&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146729"&gt;@SreeRam&lt;/a&gt;&amp;nbsp;, here are some suggestions for you.&lt;/P&gt;
&lt;P&gt;Based on the error you're encountering with the &lt;STRONG&gt;hls-patient-risk&lt;/STRONG&gt;&amp;nbsp;solution accelerator, this is a common issue related to MLflow artifact access and storage configuration in Databricks. The problem stems from how MLflow stores artifacts in managed locations and the permissions required to access them.&lt;/P&gt;
&lt;H2&gt;Root Cause&lt;/H2&gt;
&lt;P&gt;The issue occurs because MLflow stores artifacts in a managed location (`dbfs:/databricks/mlflow-tracking/`), and direct DBFS access to this location is restricted. When the second notebook tries to download artifacts from the AutoML run, it fails because the "data" folder may not have been created or the artifact download method isn't using the proper MLflow client APIs.&lt;/P&gt;
&lt;H2&gt;Solutions&lt;/H2&gt;
&lt;H3&gt;Solution 1: Verify MLflow Client Usage&lt;/H3&gt;
&lt;P&gt;Ensure the code in `02-automl-best-model.py` uses the MLflow client API instead of direct DBFS access:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;import mlflow&lt;BR /&gt;from mlflow.tracking import MlflowClient&lt;/P&gt;
&lt;P&gt;client = MlflowClient()&lt;/P&gt;
&lt;P&gt;# Instead of direct artifact download&lt;BR /&gt;# Use the MLflow client method&lt;BR /&gt;run_id = "&amp;lt;your-automl-run-id&amp;gt;"&lt;BR /&gt;local_path = client.download_artifacts(run_id, "data", dst_path="/tmp/artifacts")&lt;BR /&gt;```&lt;/P&gt;
&lt;H3&gt;Solution 2: Configure Custom Artifact Location&lt;/H3&gt;
&lt;P&gt;Before running the AutoML experiment, set a custom artifact location that points directly to your S3 bucket:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;import mlflow&lt;/P&gt;
&lt;P&gt;# Set custom artifact location when creating experiment&lt;BR /&gt;experiment_name = "/Users/&amp;lt;your-user&amp;gt;/patient-risk"&lt;BR /&gt;artifact_location = "s3://databricks-workspace-stack-c416f-bucket/mlflow-artifacts"&lt;/P&gt;
&lt;P&gt;try:&lt;BR /&gt;experiment_id = mlflow.create_experiment(&lt;BR /&gt;experiment_name, &lt;BR /&gt;artifact_location=artifact_location&lt;BR /&gt;)&lt;BR /&gt;except:&lt;BR /&gt;experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id&lt;/P&gt;
&lt;P&gt;mlflow.set_experiment(experiment_name)&lt;BR /&gt;```&lt;/P&gt;
&lt;H3&gt;Solution 3: Verify Cluster Instance Profile&lt;/H3&gt;
&lt;P&gt;Ensure your cluster has the correct IAM instance profile with permissions to write to the S3 bucket:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Required S3 Permissions:&lt;/STRONG&gt;&lt;BR /&gt;- `s3:PutObject`&lt;BR /&gt;- `s3:GetObject`&lt;BR /&gt;- `s3:DeleteObject`&lt;BR /&gt;- `s3:ListBucket`&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;To configure:&lt;/STRONG&gt;&lt;BR /&gt;1. Go to your Databricks workspace&lt;BR /&gt;2. Navigate to Compute → Select your cluster&lt;BR /&gt;3. Edit configuration → AWS attributes&lt;BR /&gt;4. Verify the instance profile has S3 write permissions to your bucket&lt;/P&gt;
&lt;H3&gt;Solution 4: Alternative Approach - Load Data Directly&lt;/H3&gt;
&lt;P&gt;If artifact download continues to fail, modify the code to load the training data directly from the source table rather than from MLflow artifacts:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;# Instead of downloading artifacts&lt;BR /&gt;# Load training data directly from Delta table&lt;BR /&gt;training_data = spark.table("catalog.schema.patient_features_table").toPandas()&lt;/P&gt;
&lt;P&gt;# Or if using a specific path&lt;BR /&gt;training_data = spark.read.format("delta").load("dbfs:/path/to/training/data").toPandas()&lt;BR /&gt;```&lt;/P&gt;
&lt;H3&gt;Solution 5: Check AutoML Run Output&lt;/H3&gt;
&lt;P&gt;Verify that the AutoML run in notebook 01 actually created the expected artifacts:&lt;/P&gt;
&lt;P&gt;```python&lt;BR /&gt;from mlflow.tracking import MlflowClient&lt;/P&gt;
&lt;P&gt;client = MlflowClient()&lt;BR /&gt;run_id = "&amp;lt;automl-run-id&amp;gt;"&lt;/P&gt;
&lt;P&gt;# List all artifacts for the run&lt;BR /&gt;artifacts = client.list_artifacts(run_id)&lt;BR /&gt;for artifact in artifacts:&lt;BR /&gt;print(f"Artifact: {artifact.path}")&lt;BR /&gt;```&lt;/P&gt;
&lt;H2&gt;Additional Troubleshooting Steps&lt;/H2&gt;
&lt;P&gt;1. &lt;STRONG&gt;Verify MLflow version&lt;/STRONG&gt;: Ensure you're using MLflow 1.9.1 or above:&lt;BR /&gt;```python&lt;BR /&gt;%pip install --upgrade mlflow&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;2. &lt;STRONG&gt;Check experiment location&lt;/STRONG&gt;: In notebook 01, after the AutoML run completes, verify where artifacts are stored:&lt;BR /&gt;```python&lt;BR /&gt;run = mlflow.get_run(run_id)&lt;BR /&gt;print(f"Artifact URI: {run.info.artifact_uri}")&lt;BR /&gt;```&lt;/P&gt;
&lt;P&gt;3. &lt;STRONG&gt;Review workspace configuration&lt;/STRONG&gt;: If you're using a cross-account S3 bucket, ensure the workspace has proper IAM role assumptions configured.&lt;/P&gt;
&lt;P&gt;The most reliable solution is &lt;STRONG&gt;Solution 1&lt;/STRONG&gt;&amp;nbsp;combined with &lt;STRONG&gt;Solution 3&lt;/STRONG&gt;, as this uses the proper MLflow client APIs while ensuring your cluster has the necessary permissions to access S3.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps, Louis.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 01 Nov 2025 19:48:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/patient-risk-score-based-on-health-history-unable-to-create-data/m-p/137260#M4398</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-11-01T19:48:09Z</dc:date>
    </item>
  </channel>
</rss>

