Databricks Community

Prem1 · ‎08-10-2022

I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day with the following error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7.0 (TID 24) (10.150.38.137 executor 0): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-04-27T20:09:00 (Attached the complete error message)

I deleted the checkpoint, and target delta table and loaded fresh with the option "cloudFiles.includeExistingFiles":"true". All files loaded successfully and then after a couple of incremental loads the same error occurred.

Autoloader configurations

{"cloudFiles.format":"json","cloudFiles.useNotifications":"false", "cloudFiles.inferColumnTypes":"true", "cloudFiles.schemaEvolutionMode":"addNewColumns", "cloudFiles.includeExistingFiles":"false"}

Path location passed as below

raw_data_location : dbfs:/mnt/DEV-cdl-raw/data/storage-xxxxx/xxxx/

target_delta_table_location : dbfs:/mnt/DEV-cdl-bronze/data/storage-xxxxx/xxxx/

checkpoint_location : dbfs:/mnt/DEV-cdl-bronze/configuration/autoloader/storage-xxxxx/xxxx/checkpoint/

schema_location : dbfs:/mnt/DEV-cdl-bronze/metadata/storage-xxxxx/xxxx/

StreamingQuery = StreamDF.writeStream \

.option("checkpointLocation", checkpoint_location) \

.option("mergeSchema", "true") \

.queryName(f"AutoLoad_RawtoBronze_{sourceFolderName}_{sourceEntityName}") \

.trigger(availableNow=True) \

.partitionBy(targetPartitionByCol) \

.start(target_delta_table_location)

Can someone help me here?

Thanks in advance.

Alexey · ‎08-17-2022

I think you are running into the same problem as I do right now: Autoloader (or something even deeper) doesn't like ":" (colon) in the file names. 😕

Loading the files with the simple Spark read option works fine.

Prem1 · ‎08-17-2022

I don't understand the part why it's not consistent with the failure. It runs fine for a few runs then it stops with this error. It is a strange situation.

Alexey · ‎08-18-2022

so, for me it breaks directly on the first file with the colon in the name.

B_Seibert · ‎11-21-2022

Yes, for us it runs several times before error.

Vidula · ‎09-08-2022

Hi there @PREM KUMAR KUMMAN RAMESH

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Alexey · ‎09-09-2022

Hi @Vidula Khanna

I can report from my side: I wasn't able to solve the issue with the AutoLoader. For my daily job, I first perform

os.walk(...)

in Python and check if there are any files with "colon" in the naming (and some other criteria). If everything is fine, I use AutoLoader for incremental load, else I reload the data every time.

Luckily for us, the biggest chunk of data is fine, but I hope that this issue will be fixed some day.

B_Seibert · ‎11-21-2022

@Alexey Egorov , can you tell us more precisely what you mean by "I reload the data"?

Alexey · ‎11-21-2022

By reloading I mean to load all the existing data in that folder. As mentioned above:

if there are no special charaters that make AutoLoader fail we can do:

``

autoloader = spark.readStream.format("cloudFiles") \

.option("cloudFiles.format", data_format) \

.option("header", "true") \

.option("cloudFiles.schemaLocation", schema_location) \

.option("cloudFiles.allowOverwrites", "true") \

.load(path)

``

in the second case, where Autloader will fail (at least we know from experience, that it does with the colon in the file names), we use simple data load:

``

df = spark.read.format(data_format)\

.option("header", "true") \

.load(path)

``

That is why I mentioned that luckily for us, this data folder is not that huge and it works fast.

Andrei_Radulesc · ‎11-21-2022

For me, the simple read also throws an "java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI" exception when it encounters files with ':' in the name:

df1= spark.read.format("binaryFile").load("s3://bucket_name/*/*/*/*.bag")

So the problem is not solved.

Alexey · ‎11-21-2022

wait, but I think this is another problem. We are mounting an S3 bucket into DBFS and my path is then something like this:

S3_BUCKET_PATH = "dbfs:/mnt/mounted_bucket_name/"

df = spark.read.format(format).load(S3_BUCKET_PATH)

Andrei_Radulesc · ‎11-21-2022

Here is the stacktrace:

Andrei_Radulesc · ‎11-21-2022

For me, it's the same error when mounting through a mount point:

schema = StructType() \

.add("path", StringType(), False) \

.add("modificationTime", StringType(), False) \

.add("length", IntegerType(), False) \

.add("content", BinaryType(), True)

df = spark.read.format("binaryFile").schema(schema) \

.load("dbfs:/mnt/bucket_name/[...]/*/*.bag")

IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-11-16T22:01:49+00:00

Surely enough, I have some files in there with a ':' character in the name. Incidentally, the Databricks architect who advised us a while back said that mount points are obsolete, and don't play well with the Unity Catalog permission scheme, so I've tried to refrain from using mount points.

B_Seibert · ‎11-21-2022

Previously we have run AutoLoader many times on very similar folder names without a fail. Now we get:

StreamingQueryException: Job aborted due to stage failure: Task 1 in stage 1657.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1657.0 (TID 5451) (10.38.20.138 executor 17): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-03-07T20:47:0