Friday
Hello,
I am trying to use pipelines in Databricks to ingest data from an external location to the datalake using AutoLoader, and I am facing this issue. I have noticed other posts with similar errors, but in those posts, the error was related to the destination table already being registered as managed.
In my case, it appears that the error is related to the event log table associated with the AutoLoader. I tried re-creating the pipeline but it didn't help. Any idea how to resolve this?
Error:
AnalysisException: Traceback (most recent call last):
File "/Users/name.surname@domain.se/.bundle/Testproject_2/dev/files/src/notebook", cell 4, line 11
2 csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
3 schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
4 df = (
5 session.readStream
6 .format("cloudFiles")
7 .option("cloudFiles.format", "csv")
8 .option("header", "true")
9 .option("inferSchema", "true")
10 .option("cloudFiles.schemaLocation", schema_location)
---> 11 .load(csv_file_path)
12 )
AnalysisException: [RequestId=3ef8b745-48dc-4ae1-b2f6-9afaaf442c3b ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP] Input path url 'abfss://unity-catalog-storage@devdomaindatalakesc01.dfs.core.windows.net/dev-data-domain/__unitystorage/catalogs/cf3123b2-b661-48d9-9baa-a0b0214d5a29/tables/3775a194-3db0-48a6-8c0e-cce43c26c9e7/_dlt_metadata/_autoloader' overlaps with managed storage within 'CheckPathAccess' call. .
Relevant code:
from pyspark.sql.functions import *
csv_file_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/dummy.csv"
schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/_schema8/"
df = (
session.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("inferSchema", "true")
.option("cloudFiles.schemaLocation", schema_location)
.load(csv_file_path)
)
checkpoint_path = "/Volumes/dev-data-domain/bronze/test/_checkpoint5"
query = (
df.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(once=True)
.toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)
Sunday
Hello @mattstyl-ff
As you can see the error :
ErrorClass=INVALID_PARAMETER_VALUE.LOCATION_OVERLAP]
Databricks automatically manages the storage location under the UC catalog’s storage root.
either you don’t need to (and shouldn’t) set schemaLocation or checkpointLocation.
or
you must explicitly set them to an external ADLS path (outside UC) like below:
schema_location = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/autoloader/schema/testproject"
checkpoint_path = "abfss://storage-dm-int-container@devdomaindmdbxint01.dfs.core.windows.net/autoloader/checkpoints/testproject"
df = (
session.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("inferSchema", "true")
.option("cloudFiles.schemaLocation", schema_location)
.load(csv_file_path)
)
query = (
df.writeStream
.format("delta")
.option("checkpointLocation", checkpoint_path)
.outputMode("append")
.trigger(once=True)
.toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)
try to update the code and Clean Up Existing Artifacts.
I hope this will help you.
yesterday - last edited yesterday
I tried removing the paths completely, but I still get the same error.
I also ensured that both the checkpoint and the schema path are on an external storage and set them explicitly, but I still get the same error. I have tested reading from the same path without AutoLoader, and that works without any issue.
The following example with the same container name and storage account name works:
df = spark.read.format("csv").option("header", "true").load(f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/")
yesterday
Hello @mattstyl-ff
Before doing this: try test by dropping the table, delete pysical files as well also,
df = (
session.readStream
.format("cloudFiles")
.option("cloudFiles.format", "csv")
.option("header", "true")
.option("inferSchema", "true")
.load(csv_file_path)
)
query = (
df.writeStream
.format("delta")
.outputMode("append")
.trigger(once=True)
.toTable("`dev-data-domain`.bronze.delta_table_pipeline3")
)
yesterday
I am open to solution from other contributors on this.
yesterday
There is no table created yet. I tried deleting the pipeline and creating a new one, with new file names, it still fails.
I noticed that the same error happens if I try to read from the event log location, using spark.read().
Example:
path = "abfss://unity-catalog-storage@devdmdatalakesc01.dfs.core.windows.net/dev-data-dm/__unitystorage/catalogs/cf3123b2-b661-48d9-9baa-a0b0214d5a29/tables/3775a194-3db0-48a6-8c0e-cce43c26c9e7/part-00000-00805a51-0fde-44e7-bdea-c6125cec5796-c000.snappy.parquet"
spark.read.format("parquet").load(path).display()
This gives me the same exact LOCATION OVERLAP error as the one in the original post above.
yesterday
If you are available we can join a call after an hour @mattstyl-ff
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now