โ08-10-2022 03:00 PM
I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day with the following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7.0 (TID 24) (10.150.38.137 executor 0): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-04-27T20:09:00 (Attached the complete error message)
I deleted the checkpoint, and target delta table and loaded fresh with the option "cloudFiles.includeExistingFiles":"true". All files loaded successfully and then after a couple of incremental loads the same error occurred.
Autoloader configurations
{"cloudFiles.format":"json","cloudFiles.useNotifications":"false", "cloudFiles.inferColumnTypes":"true", "cloudFiles.schemaEvolutionMode":"addNewColumns", "cloudFiles.includeExistingFiles":"false"}
Path location passed as below
raw_data_location : dbfs:/mnt/DEV-cdl-raw/data/storage-xxxxx/xxxx/
target_delta_table_location : dbfs:/mnt/DEV-cdl-bronze/data/storage-xxxxx/xxxx/
checkpoint_location : dbfs:/mnt/DEV-cdl-bronze/configuration/autoloader/storage-xxxxx/xxxx/checkpoint/
schema_location : dbfs:/mnt/DEV-cdl-bronze/metadata/storage-xxxxx/xxxx/
StreamingQuery = StreamDF.writeStream \
.option("checkpointLocation", checkpoint_location) \
.option("mergeSchema", "true") \
.queryName(f"AutoLoad_RawtoBronze_{sourceFolderName}_{sourceEntityName}") \
.trigger(availableNow=True) \
.partitionBy(targetPartitionByCol) \
.start(target_delta_table_location)
Can someone help me here?
โ
Thanks in advance.โ
โ08-17-2022 05:06 AM
I think you are running into the same problem as I do right now: Autoloader (or something even deeper) doesn't like ":" (colon) in the file names. ๐
Loading the files with the simple Spark read option works fine.
โ08-17-2022 08:44 PM
I don't understand the part why it's not consistent with the failure. It runs fine for a few runs โthen it stops with this error. It is a strange situation.
โ08-18-2022 12:37 AM
so, for me it breaks directly on the first file with the colon in the name.
โ11-21-2022 05:53 AM
Yes, for us it runs several times before error.
โ09-08-2022 04:23 AM
Hi there @PREM KUMAR KUMMAN RAMESHโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!
โ09-09-2022 06:39 AM
Hi @Vidula Khannaโ
I can report from my side: I wasn't able to solve the issue with the AutoLoader. For my daily job, I first perform
os.walk(...)
in Python and check if there are any files with "colon" in the naming (and some other criteria). If everything is fine, I use AutoLoader for incremental load, else I reload the data every time.
Luckily for us, the biggest chunk of data is fine, but I hope that this issue will be fixed some day.
โ11-21-2022 05:56 AM
@Alexey Egorovโ , can you tell us more precisely what you mean by "I reload the data"?
โ11-21-2022 06:33 AM
By reloading I mean to load all the existing data in that folder. As mentioned above:
``
autoloader = spark.readStream.format("cloudFiles") \
.option("cloudFiles.format", data_format) \
.option("header", "true") \
.option("cloudFiles.schemaLocation", schema_location) \
.option("cloudFiles.allowOverwrites", "true") \
.load(path)
``โ
``
df = spark.read.format(data_format)\
.option("header", "true") \
.load(path)
``โ
That is why I mentioned that luckily for us, this data folder is not that huge and it works fast.
โ11-21-2022 07:20 AM
For me, the simple read also throws an "java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI" exception when it encounters files with ':' in the name:
df1= spark.read.format("binaryFile").load("s3://bucket_name/*/*/*/*.bag")
So the problem is not solved.
โ11-21-2022 07:26 AM
wait, but I think this is another problem. We are mounting an S3 bucket into DBFS and my path is then something like this:
S3_BUCKET_PATH = "dbfs:/mnt/mounted_bucket_name/"
df = spark.read.format(format).load(S3_BUCKET_PATH)
โ11-21-2022 07:52 AM
Here is the stacktrace:
โ11-21-2022 02:14 PM
For me, it's the same error when mounting through a mount point:
schema = StructType() \
.add("path", StringType(), False) \
.add("modificationTime", StringType(), False) \
.add("length", IntegerType(), False) \
.add("content", BinaryType(), True)
df = spark.read.format("binaryFile").schema(schema) \
.load("dbfs:/mnt/bucket_name/[...]/*/*.bag")
IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-11-16T22:01:49+00:00
Surely enough, I have some files in there with a ':' character in the name. Incidentally, the Databricks architect who advised us a while back said that mount points are obsolete, and don't play well with the Unity Catalog permission scheme, so I've tried to refrain from using mount points.
โ11-21-2022 06:05 AM
Previously we have run AutoLoader many times on very similar folder names without a fail. Now we get:
StreamingQueryException: Job aborted due to stage failure: Task 1 in stage 1657.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1657.0 (TID 5451) (10.38.20.138 executor 17): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-03-07T20:47:0
DESCRIBE HISTORY
operation (most recent at top)
MERGE
MERGE
ADD COLUMNS --started having problems after this
RESTORE
RESTORE
RESTORE
RESTORE
MERGE
MERGE
โ11-21-2022 08:10 AM
Have either of you .changed the schema? We experienced the problem after we did ADD COLUMNS
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group