cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

java.lang.IllegalArgumentException: java.net.URISyntaxException

Prem1
New Contributor III

I am using Databricks Autoloader to load JSON files from ADLS gen2 incrementally in directory listing mode. All source filename has Timestamp on them. The autoloader works perfectly couple of days with the below configuration and breaks the next day with the following error.

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 7.0 failed 4 times, most recent failure: Lost task 1.3 in stage 7.0 (TID 24) (10.150.38.137 executor 0): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-04-27T20:09:00 (Attached the complete error message)

I deleted the checkpoint, and target delta table and loaded fresh with the option "cloudFiles.includeExistingFiles":"true". All files loaded successfully and then after a couple of incremental loads the same error occurred.

Autoloader configurations

{"cloudFiles.format":"json","cloudFiles.useNotifications":"false", "cloudFiles.inferColumnTypes":"true", "cloudFiles.schemaEvolutionMode":"addNewColumns", "cloudFiles.includeExistingFiles":"false"}

Path location passed as below

raw_data_location : dbfs:/mnt/DEV-cdl-raw/data/storage-xxxxx/xxxx/

target_delta_table_location : dbfs:/mnt/DEV-cdl-bronze/data/storage-xxxxx/xxxx/

checkpoint_location : dbfs:/mnt/DEV-cdl-bronze/configuration/autoloader/storage-xxxxx/xxxx/checkpoint/

schema_location : dbfs:/mnt/DEV-cdl-bronze/metadata/storage-xxxxx/xxxx/

StreamingQuery = StreamDF.writeStream \

.option("checkpointLocation", checkpoint_location) \

.option("mergeSchema", "true") \

.queryName(f"AutoLoad_RawtoBronze_{sourceFolderName}_{sourceEntityName}") \

.trigger(availableNow=True) \

.partitionBy(targetPartitionByCol) \

.start(target_delta_table_location)

Can someone help me here?

Thanks in advance.​

21 REPLIES 21

Alexey
Contributor

I think you are running into the same problem as I do right now: Autoloader (or something even deeper) doesn't like ":" (colon) in the file names. 😕

Loading the files with the simple Spark read option works fine.

Prem1
New Contributor III

I don't understand the part why it's not consistent with the failure. It runs fine for a few runs ​then it stops with this error. It is a strange situation.

so, for me it breaks directly on the first file with the colon in the name.

B_Seibert
New Contributor III

Yes, for us it runs several times before error.

Vidula
Honored Contributor

Hi there @PREM KUMAR KUMMAN RAMESH​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Hi @Vidula Khanna​ 

I can report from my side: I wasn't able to solve the issue with the AutoLoader. For my daily job, I first perform

os.walk(...)

in Python and check if there are any files with "colon" in the naming (and some other criteria). If everything is fine, I use AutoLoader for incremental load, else I reload the data every time.

Luckily for us, the biggest chunk of data is fine, but I hope that this issue will be fixed some day.

B_Seibert
New Contributor III

@Alexey Egorov​ , can you tell us more precisely what you mean by "I reload the data"?

By reloading I mean to load all the existing data in that folder. As mentioned above:

  • if there are no special charaters that make AutoLoader fail we can do:

``

autoloader = spark.readStream.format("cloudFiles") \

.option("cloudFiles.format", data_format) \

.option("header", "true") \

.option("cloudFiles.schemaLocation", schema_location) \

.option("cloudFiles.allowOverwrites", "true") \

.load(path)

``​

  • in the second case, where Autloader will fail (at least we know from experience, that it does with the colon in the file names), we use simple data load:

``

df = spark.read.format(data_format)\

.option("header", "true") \

.load(path)

``​

That is why I mentioned that luckily for us, this data folder is not that huge and it works fast.

For me, the simple read also throws an "java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI" exception when it encounters files with ':' in the name:

df1= spark.read.format("binaryFile").load("s3://bucket_name/*/*/*/*.bag")

So the problem is not solved.

wait, but I think this is another problem. We are mounting an S3 bucket into DBFS and my path is then something like this:

S3_BUCKET_PATH = "dbfs:/mnt/mounted_bucket_name/"

df = spark.read.format(format).load(S3_BUCKET_PATH)

Here is the stacktrace:

For me, it's the same error when mounting through a mount point:

schema = StructType() \

    .add("path", StringType(), False) \

    .add("modificationTime", StringType(), False) \

    .add("length", IntegerType(), False) \

    .add("content", BinaryType(), True)

df = spark.read.format("binaryFile").schema(schema) \

     .load("dbfs:/mnt/bucket_name/[...]/*/*.bag")

IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-11-16T22:01:49+00:00

Surely enough, I have some files in there with a ':' character in the name. Incidentally, the Databricks architect who advised us a while back said that mount points are obsolete, and don't play well with the Unity Catalog permission scheme, so I've tried to refrain from using mount points.

B_Seibert
New Contributor III

Previously we have run AutoLoader many times on very similar folder names without a fail. Now we get:

StreamingQueryException: Job aborted due to stage failure: Task 1 in stage 1657.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1657.0 (TID 5451) (10.38.20.138 executor 17): java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2022-03-07T20:47:0

  • "2022-03-07T20.47.04.000Z" = fails now in November
  • "2022-03-07T20.47.04.000Z" = succeeded from July through October

DESCRIBE HISTORY

operation (most recent at top)

MERGE

MERGE

ADD COLUMNS --started having problems after this

RESTORE

RESTORE

RESTORE

RESTORE

MERGE

MERGE

B_Seibert
New Contributor III

Have either of you .changed the schema? We experienced the problem after we did ADD COLUMNS

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group