08-10-2022 03:00 PM
I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day with the following error.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7.0 (TID 24) (10.150.38.137 executor 0): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-04-27T20:09:00 (Attached the complete error message)
I deleted the checkpoint, and target delta table and loaded fresh with the option "cloudFiles.includeExistingFiles":"true". All files loaded successfully and then after a couple of incremental loads the same error occurred.
Autoloader configurations
{"cloudFiles.format":"json","cloudFiles.useNotifications":"false", "cloudFiles.inferColumnTypes":"true", "cloudFiles.schemaEvolutionMode":"addNewColumns", "cloudFiles.includeExistingFiles":"false"}
Path location passed as below
raw_data_location : dbfs:/mnt/DEV-cdl-raw/data/storage-xxxxx/xxxx/
target_delta_table_location : dbfs:/mnt/DEV-cdl-bronze/data/storage-xxxxx/xxxx/
checkpoint_location : dbfs:/mnt/DEV-cdl-bronze/configuration/autoloader/storage-xxxxx/xxxx/checkpoint/
schema_location : dbfs:/mnt/DEV-cdl-bronze/metadata/storage-xxxxx/xxxx/
StreamingQuery = StreamDF.writeStream \
.option("checkpointLocation", checkpoint_location) \
.option("mergeSchema", "true") \
.queryName(f"AutoLoad_RawtoBronze_{sourceFolderName}_{sourceEntityName}") \
.trigger(availableNow=True) \
.partitionBy(targetPartitionByCol) \
.start(target_delta_table_location)
Can someone help me here?
Thanks in advance.
11-21-2022 10:03 AM
I did not change the schema. The schema is fixed in my case.
11-21-2022 09:04 AM
We did not get an error until we ran ADD COLUMNS. Has anyone else done similar DDL changes? BTW I get the same error in Azure.
01-03-2023 07:17 AM
Hello to everyone. We filed a support ticket with Databricks. This is the response I received, along with an interim solution to the problem. I hope it is useful to those who read it.
Problem Statement:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7.0 (TID 24) (10.150.38.137 executor 0): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-04-27T20:09:00 (Attached the complete error message)
Root Cause Analysis:
We have an incremental listing mode to speedup the listing by not scanning prefixes we saw before but the incremental listing mode does not like file names with certain special char in it for example:
and if you upload a file with a special character into DBFS file gets renamed and automatically replaced by _
This is currently in our roadmap but don't have an exact ETA.
Solution:
use the below config to mitigate the issue
cloudFiles.useIncrementalListing to false.
01-16-2023 05:01 AM
Thanks for raising this issue and having this suggestion. I will give it a try and report if it worked for me. Is it possible to forward this thread to the databricks team and they can comment when this bug is fixed?
01-04-2023 06:56 AM
Hi Everyone,
I'm seeing this issue as well - same configuration of the previous posts, using autoloader with incremental file listing turned on. The strange part is that it mostly works despite almost all of the files we're loading having colons included as part of the timestamp.
It seems to be happening more frequently now, which is becoming an issue. Having to rarely clear a checkpoint is very different than needing to clear checkpoints each day. I'm also not comfortable clearing the checkpoint programmatically. I'm lucky in our case that I have some level of control of how the files get named, so removing colons from the timestamp is a possibility.
Including a couple bullet points here for anyone else struggling though this.
I haven't tried setting cloudFiles.useIncrementalListing to false, but this also feels like an unideal fix for my purposes. I'll be following updates in this tread closely, Also thanks for everyone who already shared info previously.
01-16-2023 04:59 AM
Thanks for sharing your experience. Will the setting for useIncrementalListing somehow change your processing strategy? I think it is only for performance improvements, isn't it? I am really not quite sure. But seems actually more like a bug, that will be fixed one day. It would be great if someone from databricks would comment here when this is fixed.
01-17-2023 02:01 PM
It wouldn't necessarily change how I'm processing these files, but if I understood the documentation correctly, over time it may increase costs due to how it would have to batch the api requests to the storage layer to check for new files.
I'm not sure how long it would actually take before those increased costs would actually be noticeable (maybe never for our volume).
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group