cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Unable to read files using Auto Loader

AanchalSoni
Databricks Partner

Hi!

I'm trying to create an ETL pipeline. It reads data from a UC volume, however, Databricks is not allowing me to do so. The following error is generated:

AnalysisException: [RequestId=a11e017b-61db-4c30-a03a-d7cce55e5aea ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 's3://dbstorage-prod-6ubki/uc/670643ac-88ac-4f51-8bb0-2311c001fab6/6b491f6f-d67e-44fe-9e04-bad30ec7a8cc/__unitystorage/catalogs/5f4192b5-79f2-415f-bfe8-729b201e40b9/tables/ea03463f-90af-4941-b2a6-47782054b3c9/_dlt_metadata/_autoloader' overlaps with managed storage within 'CheckPathAccess' call. .

Is it not possible to read directly from a volume using Auto Loader? Should the raw files be read from an external location only? Please guide.

4 REPLIES 4

balajij8
Contributor

You can absolutely use Auto Loader with files from volume. The issue is a path conflict in your case. Managed areas of a table or volume are not to be touched to ensure data integrity and security governed by UC.

You can use the Unity Catalog Volume path in the Auto Loader. Here is the Auto Loader implementation using the recommended Volume path. This ensures the conflicts are avoided.

 

# Defined Schema (Ensure this matches your JSON structure)
schema = "id INT"

df = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", "/Volumes/workspace/default/sys/schema")
    .load("/Volumes/workspace/dev/input/") # UC Volume Path
    .writeStream
    .format("delta")
    .option("checkpointLocation", "/Volumes/workspace/default/sys/checkpoint")
    .option("mergeSchema", "true")
    .trigger(availableNow=True)
    .toTable("uc.default.json_files"))

lingareddy_Alva
Esteemed Contributor

Hi @AanchalSoni .

This is a well-known Unity Catalog constraint. Let me explain in detail.

The error INVALID_PARAMETER_VALUE.LOCATION_OVERLAP is thrown because Auto Loader's checkpoint/schema location overlaps with UC-managed storage. Specifically:
    UC Volumes are backed by managed S3 paths under Databricks' internal storage (dbstorage-prod-*/uc/.../).
    Auto Loader writes its _dlt_metadata/_autoloader checkpoint directory into that same managed path space.
    UC's CheckPathAccess guard explicitly blocks any process from writing into managed storage paths it doesn't own — including Auto Loader's internal bookkeeping.

This is not a permissions issue you can grant your way out of. It's a hard architectural constraint in Unity Catalog.

The Fix: Separate the Checkpoint Location
You don't need to move your source files to an external location. You just need to point the checkpoint and schema location somewhere outside UC-managed storage.

Option 1 — External Location (Recommended for Production)

df = (
    spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "parquet")           # or json, csv, etc.
    .option("cloudFiles.schemaLocation", "s3://your-external-bucket/checkpoints/schema/pipeline_x")
    .load("/Volumes/your_catalog/<schema>/<volume>/raw/")  # UC Volume path — fine here
    .writeStream
    .option("checkpointLocation", "s3://your-external-bucket/checkpoints/pipeline_x")
    .table("your_catalog.<schema>.target_table")
)

The external bucket must be registered as a UC External Location with CREATE EXTERNAL LOCATION and appropriate storage credentials.

Option 2 — Use DLT (Cleanest for UC)
DLT manages its own checkpoint state completely outside your control path, so you never hit this conflict:

import dlt

@dlt.table
def bronze_raw():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "parquet")
        .option("cloudFiles.schemaLocation",
                "/Volumes/your_catalog/<schema>/<volume>/autoloader_schema/")
        .load("/Volumes/your_catalog/<schema>/<volume>/raw/")
    )


Note that with DLT, the schemaLocation can live inside the Volume (it's only the checkpoint that conflicts, not the schema inference directory in all cases — though keeping it external is cleaner).

Summary Recommendation
Your source files staying in the UC Volume is perfectly fine and correct. The only change needed is routing your checkpointLocation and schemaLocation to a registered UC External Location on S3. If this pipeline is already in a DLT context (given your medallion setup in your catalog), the DLT option is the cleanest path with zero checkpoint management overhead.

 

 

LR

AanchalSoni
Databricks Partner

Thanks @BalaS @lingareddy_Alva for your quick responses.

I've updated the schema location to:

option("schemaLocation", "/Volumes/workspace/capstone/schema")
 
and checkpoint location to: 
/Volumes/workspace/capstone/checkpoint/1/
 
however, I'm still getting the same error. I'm using Databricks free version to develop a test pipeline.
 

szymon_dybczak
Esteemed Contributor III

Hi  ,

You need to set schemaLocation in following way (don't ommit cloudFiles prefix)

.option("cloudFiles.schemaLocation", "<path-to-schema>")