cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Error with AutoLoader pipeline ingesting from external location: LOCATION_OVERLAP

mattstyl-ff
New Contributor

Hello,

I am trying to use pipelines in Databricks to ingest data from an external location to the datalake using AutoLoader, and I am facing this issue. I have noticed other posts with similar errors, but in those posts, the error was related to the destination table already being registered as managed.

In my case, it appears that the error is related to the event log table associated with the AutoLoader. I tried re-creating the pipeline but it didn't help. Any idea how to resolve this?

Error: 

AnalysisException: Traceback (most recent call last):
File "/Users/name.surname@domain.se/.bundle/Testproject_2/dev/files/src/notebook", cell 4, line 11
      2 csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
      3 schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
      4 df = (
      5     session.readStream
      6     .format("cloudFiles")
      7     .option("cloudFiles.format", "csv")
      8     .option("header", "true")
      9     .option("inferSchema", "true")
     10     .option("cloudFiles.schemaLocation", schema_location)
---> 11     .load(csv_file_path)
     12 )

AnalysisException: [RequestId=3ef8b745-48dc-4ae1-b2f6-9afaaf442c3b ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 'abfss://unity-catalog-storage@devdomaindatalakesc01.dfs.core.windows.net/dev-data-domain/__unitystorage/catalogs/cf3123b2-b661-48d9-9baa-a0b0214d5a29/tables/3775a194-3db0-48a6-8c0e-cce43c26c9e7/_dlt_metadata/_autoloader' overlaps with managed storage within 'CheckPathAccess' call. .

 Relevant code:

from pyspark.sql.functions import *
csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
df = (
    session.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "csv")
    .option("header", "true")
    .option("inferSchema", "true")
    .option("cloudFiles.schemaLocation", schema_location)
    .load(csv_file_path)
)

checkpoint_path = "/Volumes/dev-data-domain/bronze/test/_checkpoint5"  

query = (
    df.writeStream
    .format("delta")
    .option("checkpointLocation", checkpoint_path)
    .outputMode("append")
    .trigger(once=True)
    .toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)

 

6 REPLIES 6

Khaja_Zaffer
Contributor

Hello @mattstyl-ff 

As you can see the error : 

ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP]

Databricks automatically manages the storage location under the UC catalog’s storage root.

either you don’t need to (and shouldn’t) set schemaLocation or checkpointLocation.

or 

you must explicitly set them to an external ADLS path (outside UC) like below: 

 

schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/autoloader/schema/testproject"
checkpoint_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/autoloader/checkpoints/testproject"

df = (
session.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("inferSchema", "true")
.option("cloudFiles.schemaLocation", schema_location)
.load(csv_file_path)
)

query = (
df.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(once=True)
.toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)

 

try to update the code and Clean Up Existing Artifacts.

 

I hope this will help you. 

I tried removing the paths completely, but I still get the same error. 

I also ensured that both the checkpoint and the schema path are on an external storage and set them explicitly, but I still get the same error. I have tested reading from the same path without AutoLoader, and that works without any issue. 

The following example with the same container name and storage account name works:

df = spark.read.format("csv").option("header", "true").load(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/")

 

Khaja_Zaffer
Contributor

Hello @mattstyl-ff 

Before doing this: try test by dropping the table, delete pysical files as well also, 

Clean Any Custom/Residual Paths 
paths are : 
abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/
/Volumes/dev-data-domain/bronze/test/_checkpoint5
 
please also monitor the event logs. 
 
Lets dont need to set schemaLocation or checkpointLocation. As DLT automatically manages both under its _dlt_metadata directory.

df = (
session.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("inferSchema", "true")
.load(csv_file_path)
)

query = (
df.writeStream
.format("delta")
.outputMode("append")
.trigger(once=True)
.toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)

Khaja_Zaffer
Contributor

I am open to solution from other contributors on this. 

mattstyl-ff
New Contributor

There is no table created yet. I tried deleting the pipeline and creating a new one, with new file names, it still fails.

I noticed that the same error happens if I try to read from the event log location, using spark.read().

Example:

path = "abfss://unity-catalog-storage@devdmdatalakesc01.dfs.core.windows.net/dev-data-dm/__unitystorage/catalogs/cf3123b2-b661-48d9-9baa-a0b0214d5a29/tables/3775a194-3db0-48a6-8c0e-cce43c26c9e7/part-00000-00805a51-0fde-44e7-bdea-c6125cec5796-c000.snappy.parquet"
spark.read.format("parquet").load(path).display()

This gives me the same exact LOCATION OVERLAP error as the one in the original post above.

If you are available we can join a call after an hour @mattstyl-ff 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now