Databricks Community

h_h_ak · ‎10-25-2024

We are encountering frequent GetPathStatus and GetBlobProperties errors when trying to access Azure Data Lake Storage (ADLS) paths through our Databricks environment. The errors consistently return a 404 PathNotFound status for paths that should be accessible.

Context:

Operation: df.write() and df.read() operations on Databricks, attempting to access Azure storage paths.
Storage Path: /stxxxx/src-sapecc/ and other related paths in Azure Data Lake Gen2.
Errors Observed:
GetPathStatus: PathNotFound
GetBlobProperties: BlobNotFound

Error Count: The errors are recurring frequently, as seen in the attached logs, which indicate multiple instances of the PathNotFound error with status code 404.

Timestamps: Errors occur across multiple timestamps (see attached logs for details).

Attached Screenshot: Logs showing details of the error, including the operation name, status codes, and paths.

Could you please assist in identifying why these PathNotFound and BlobNotFound errors are occurring despite correct configuration and permissions? Additionally, if there’s any further configuration required on the Azure or Databricks side to resolve this, please advise.Thanks in advance..

saurabh18cs · ‎10-25-2024

Hi,

1) Ensure that the paths you are trying to access are correct and exist in the ADLS Gen2 storage account.

2) Verify that the Databricks cluster has the necessary permissions to access the ADLS Gen2 paths

Br

h_h_ak · ‎10-28-2024

Hi,

I confirm that I have checked it, and everything seems to be in order—both path and permissions are definitely in place, as we are also successfully writing data to the container. I noticed that these messages come up in several situations:

1.In SQL Warehouse queries

2. During spark.read... operations

3. During spark.write... operations

We are using DBR 13.3 in this workspace. Any ideas on why so many storage-related messages are appearing? It only started happening after we enabled diagnostic settings in Azure.

aashish122 · ‎10-28-2024

Testing this in incognito mode will help !!!

h_h_ak · ‎10-28-2024

Why do you think this will help? We have Spark and connection configuration in the cluster settings, and spark.read... or write statements are executed by the notebook. Additionally, the SQL queries are coming from outside. How would Incognito help in this scenario?

h_h_ak · Monday

Adding answer from MSFT Support Team:

Why is there _delta_log being checked when the function used is parquet.
The _delta_log directory is being checked because the system is designed to scan directories and their parent directories to look for a Delta log folder. This is done to ensure that if a user is writing to a Delta table using the wrong format (e.g., using Parquet instead of Delta), the system can identify the mistake and fail the job to prevent data corruption

Why are all parent folders getting _deltalogs call?
The system recursively checks all parent directories for the _delta_log folder to determine if any of the parent directories are Delta tables. This is part of the design to ensure that the correct table format is being used and to avoid potential issues with data integrity.

What are the files _encryption_metadata/manifest.json and _spark_metadata being referenced for given that is not present in the folders.
How to remove this requests?
The _encryption_metadata/manifest.json file is being checked to determine if encryption is enabled on the storage.
The _spark_metadata directory is typically created by streaming jobs to store metadata about the stream. Even though these files may not be present in the folders, the system checks for them as part of its standard operations.

How to remove these requests?
Currently, there is no direct way to remove these requests as they are part of the system's design to ensure data integrity and correct table format usage.