topic workflow permissions errors in Data Engineering

workflow permissions errors

BillMarshall — Fri, 30 Aug 2024 01:32:24 GMT

I have a notebook that outputs an Excel file. Through trial and error, and after consulting with various forums I discovered the .xlsx file needed to be written to a temp file and then copied to the volume in Unity Catalog.

When I run the notebook by itself it works perfectly.

BUT, when I schedule the job in a workflow it fails with the following error:

PermissionError: [Errno 13] Permission denied: '/local_disk0/tmp/202407-arf743-Section1.xlsx'

The workflow is using the same cluster as when I run the notebook alone.

The workflow has the "Run as" value set to my user ID.

Any help would be appreciated

Re: workflow permissions errors

florence023 — Wed, 04 Sep 2024 05:51:14 GMT

@BillMarshall wrote:
I have a notebook that outputs an Excel file. Through trial and error, and after consulting with various forums I discovered the .xlsx file needed to be written to a temp file and then copied to the volume in Unity Catalog.
When I run the notebook by itself it works perfectly.
BUT, when I schedule the job in a workflow it fails with the following error:
PermissionError: [Errno 13] Permission denied: '/local_disk0/tmp/202407-arf743-Section1.xlsx'
The workflow is using the same cluster as when I run the notebook alone.
The workflow has the "Run as" value set to my user ID.
Any help would be appreciated

Hello,

It sounds like you’re encountering a permissions issue when your workflow tries to write to the temporary directory. Here are a few steps you can take to troubleshoot and resolve this issue:

Check File Permissions: Ensure that the directory /local_disk0/tmp/ has the correct permissions set for your user. You can use the os.chmod() function in Python to change the file permissions if necessary1.
Verify User Permissions: Double-check that the user ID under which the workflow is running has the necessary permissions to write to the specified directory. Sometimes, even if the “Run as” value is set to your user ID, there might be additional restrictions in place.
Temporary Directory Path: Ensure that the path to the temporary directory is correctly specified and accessible. Sometimes, paths might differ slightly when running in different environments.
Cluster Configuration: Verify that the cluster configuration is consistent between running the notebook manually and running it as part of the workflow. There might be differences in environment variables or permissions.
Use a Different Temporary Directory: If possible, try using a different temporary directory that you know has the correct permissions. This can help isolate whether the issue is specific to the /local_disk0/tmp/ directory.
Check for Open Files: Ensure that the file isn’t being accessed by another process or application, which might be causing the permission error.

Hope this will help you.
Best regards,
florence023

Re: workflow permissions errors

emora — Wed, 04 Sep 2024 06:54:04 GMT

Hello, yes of course you need to write the excel file in the tmp folder, but then you can move it to whatever you want without problem. In my current project we implemented this method to create the file in the tmp folder, and then move it to one specific folder in an Azure Blob Container.

def export_to_excel(dbutils, df, file_name, result_path):

from pyspark.sql.functions import row_number, monotonically_increasing_id
from pyspark.sql import Window

# Const that means the excel limit rows
excel_limit = 1048576 - 1 # for header

num_rows = df.count()
num_files = num_rows / excel_limit

if num_files < 1:

with pd.ExcelWriter(f"/databricks/driver/{file_name}.xlsx", engine='xlsxwriter') as writer:

df_export = df.toPandas()
df_export.to_excel(writer, sheet_name=tab_names[0], index=False)
secondary_df.to_excel(writer, sheet_name=tab_names[1], index=False)
writer.save()

print("moving fileName: {0} from src directory: /databricks/driver/ to dst directory: {1} ".format(file_name, result_path))
dbutils.fs.mv(f"file:/databricks/driver/{file_name}.xlsx", f"{result_path}{file_name}.xlsx")

else:

df = df.withColumn("row_idx", row_number().over(Window.orderBy(monotonically_increasing_id())))

i = 0
while i < num_files:

df_export = df.filter((df.row_idx >= excel_limit*i) & (df.row_idx < excel_limit*(i+1)))

i = i+1
print('saving ' + str(df_export.count()) + ' rows in ' + file_name + '_part' + str(i) + '.xlsx')
df_export = df_export.drop(df_export.row_idx).toPandas()
# Save the file in the cluster path for no exceptions
df_export.to_excel("{0}_part{1}.xlsx".format(file_name, i), engine='xlsxwriter', index=False, header=True)
# move the file from cluster disk to dbfs
print("moving fileName: {0}_part{1} in src directory: /databricks/driver/ to dst directory: {2} ".format(file_name, i, result_path))
dbutils.fs.mv("file:/databricks/driver/{0}_part{1}.xlsx".format(file_name, i), "{2}{0}_part{1}.xlsx".format(file_name, i, result_path))

PS: You need to install the xlsxwritter python library, in order to be able to execute this method and generate the excel file.