cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Unity Catalog - Writing to PNG Files to Cluster and then using dbutils.fs.cp to send to Azure ADLS2

aicd_de
New Contributor III

Hi All

Looking to get some help. We are on Unity Catalog in Azure. We have a requirement to use Python to write out PNG files (several) via Matplotlib and then drop those into an ADLS2 Bucket. With Unity Catalog, we can easily use dbutils.fs.cp or fs.put to do this. However, the PNGs need to be written to the Cluster first before we use Copy to move them over to an ADLS2 Bucket.

The issue: dbutils cannot access all locations on the Cluster and the folders it can access we get ERROR 13 Access Denied when trying to write PNGs to those spots. So I am not sure where i can drop the files to do the copy with. Here is the code snippet:

<CODE HERE TO GENERATE CHART>... FOLLOWED BY..

        img_name = f'{product_level_1}-{product_level_2}-{product_level_3}-{value_type_1}-{value_type_2}.png'
 
        plt.savefig('/databricks-datasets/'+img_name, bbox_inches='tight', format='png')

        dbutils.fs.cp('/databricks-datasets/'+img_name, storage_url+img_name)

        print(img_name)
 
So, the plt.save if I just go img_name, it drops to default workspace location but then dbutils cannot locate it. Then when I try to select folders dbutils can have it doesn't work with permission issues.
4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @aicd_de , 

You can use the LocalFileSystem library in Python to save the PNG files to the Databricks cluster before copying them to the ADLS2 bucket.

Here's some example code to do that:

# Import the required libraries
import matplotlib.pyplot as plt
from pyspark.sql.functions import udf
from pyspark.sql.types import BinaryType
from io import BytesIO
import os

# Define a function to convert the plot to bytes
def convert_plot_to_bytes(plot):
    buf = BytesIO()
    plot.savefig(buf, format='png')
    byte_im = buf.getvalue()
    buf.close()
    return byte_im

# Define a UDF to execute the convert_plot_to_bytes function
udf_convert_plot_to_bytes = udf(convert_plot_to_bytes, BinaryType())

# Generate the plot
fig, ax = plt.subplots()
ax.plot([1, 2, 4, 2, 1])
ax.set_title("Sample Plot")

# Convert the plot to bytes using the UDF
plot_bytes = udf_convert_plot_to_bytes(fig) 

# Save the bytes to a file on the local file system of the Databricks cluster
local_file_path = "/tmp/sample_plot.png"
with open(local_file_path, "wb") as f:
    f.write(plot_bytes)

# Copy the file from the local file system to the ADLS2 bucket
adls2_upload_path = "/mnt/<mount_name>/<path>/sample_plot.png"
dbutils.fs.cp(f"file:{local_file_path}", adls2_upload_path, recurse=True)

# Remove the file from the local file system of the Databricks cluster
os.remove(local_file_path)

This code saves the generated plot to a local file on the Databricks cluster and then copies it to the ADLS2 bucket using the dbutils.fs.cp() command.

You can replace <mount_name> it with your ADLS2 mount point's name and the path to the directory you want to upload.

You can save the PNG files to the DBFS (Databricks File System) using the local_path function's argument instead of an absolute file path.

aicd_de
New Contributor III

I get this error writing to that location:

"java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden"

Kaniz
Community Manager
Community Manager

Hi, According to this error message: "java.lang.SecurityException: Cannot use com.databricks.backend.daemon.driver.WorkspaceLocalFileSystem - local filesystem access is forbidden"

Possible reasons for the error: -

  • Service principal lacks write permissions to access the storage location
  • - Cluster has connectivity to storage location but is not authorized to access storage
  • - Access connector does not have correct role
  • - Storage firewall issues - Incorrect storage credentials or spark azure keys being used.

    Resolution steps:
  • - Access tables via unity catalog set up with managed identity
  • - Check driver log for complete error details and traceId
  • - Collect storage account details and storage credentials details
  • - Verify storage principal/managed identity has access to storage account
  • - Assign "STORAGE BLOB DATA CONTRIBUTOR" role to SP/MI at storage account level
  • - If unable to assign role at account level, set "STORAGE BLOB DELEGATOR" role at account level and provide "Storage Blob Data Contributor" role at container level

aicd_de
New Contributor III

Hmm I read something different - someone else had this error because they used a shared cluster - apparently it does not happen on a single user cluster. All those settings are already done and I am a fully admin.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.