cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Getting error hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: The specified blob d

hanifmusa
New Contributor

I am exporting parquet files (partitioned by id) in append mode. However, I encounter errors occasionally, while other times the job completes successfully.

Apache Spark Exception: Exception thrown in awaitResult: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: The specified blob does not exist.

Currently, the storage access is configured as follows:`wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/<directory-name>`

For exporting using append mode. anyone can help?

 

1 REPLY 1

balajij8
Contributor III

Its generally due to race conditions when Spark checks for existing partition files before writing combined with Azure Blob Storage's eventual consistency mode.

You can follow below

1. Switch to Delta Lake - You can use Delta Lake format instead of Parquet with append mode. Delta handles concurrency and append operations reliably.

2. Use ABFS/ABFSS Protocol in Azure Data Lake Storage - Switch from wasbs:// to abfss:// as it has better consistency guarantees. It requires your storage account to have hierarchical namespace enabled (ADLS Gen2). Enable it and use it for good results. Use Unity Catalog volumes if feasible.

spark.read.load("abfss://container@storageaccount.dfs.core.windows.net/data_path")

More details here

3. Use Overwrite with Partition Mode if append semantics aren't strictly required per partition.

4. Add Retry Logic - Wrap the write operation with retry logic to handle transient Azure Storage errors.

5. Check Storage Configuration - Ensure you are using the latest Hadoop Azure connector version and that the storage account has optimal consistency settings

You can use Delta Lake as it's ACID-compliant, handles concurrent writes safely and is the best for production workloads on Databricks.