cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Broken s3 file paths in File Notifications for auto loader

aonurdemir
Contributor

Suddenly at "2025-10-23T14:12:48.409+00:00", coming file paths from file notification queue started to be urlencoded. Hence, our pipeline gets file not found exception. I think something has changed suddenly and broke notification system. Here are the details:

Broken file path started to come from notification queue: 

my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz

The path discovered by directory listing:

my-sink/prod/app_daily/year=2025/month=10/day=23/app_daily+5+35499048368.json.gz

I found these by investigating the output of these query:

select * from cloud_files_state(TABLE(my_catalog.my_schema.app_daily_stream_v2))
order by create_time asc;


 
since (=) characted encoded as (%3D), our Declarative Pipeline fires this error:
 

org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 76ce493e-ed1e-48ce-bbda-1a9bb85cc9f7, runId = bee6661b-4319-469c-8a72-040dc517e9ff] terminated with exception: Exception thrown in awaitResult: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. [CLOUD_FILE_SOURCE_FILE_NOT_FOUND] A file notification was received for file:s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. but it does not exist anymore. Please ensure that files are not deleted before they are processed. To continue your stream, you can set the Spark SQL configuration spark.sql.files.ignoreMissingFiles to true.


 
I checked s3 bucket and saw that the file is there. Since autoloader try to go to the encoded path, it started to fire this error.
 
My problem is this: Currently, I cannot run the pipeline neither in file notification mode nor in directory listing mode without using skipMissingFiles=true option since auto loader state is dirty with these uncommitted wrong file paths. I don't want to useSkipMissingFiles since it will skip all the data. Also, I do not want to run full refresh since the source is too big. I need to clear those broken urls from the autoloader's state. 

1 ACCEPTED SOLUTION

Accepted Solutions

K_Anudeep
Databricks Employee
Databricks Employee

Hello @aonurdemir,

Could you please re-run your pipeline now and check? This issue should be mitigated now. It is due to a recent internal bug that led to the unexpected handling of file paths with special characters.

You should set ignoreMissingFiles to true to get past this error and you can remove the flag once you get past this error 

Anudeep

View solution in original post

3 REPLIES 3

K_Anudeep
Databricks Employee
Databricks Employee

Hello @aonurdemir,

Could you please re-run your pipeline now and check? This issue should be mitigated now. It is due to a recent internal bug that led to the unexpected handling of file paths with special characters.

You should set ignoreMissingFiles to true to get past this error and you can remove the flag once you get past this error 

Anudeep

RevanthV
New Contributor III

Hey @K_Anudeep ,

 

Thanks for letting us know that this is a bug and is mitigated. I tried testing it  today again as i was getting the same error last week but it doesn't occur now anymore 

Hello @K_Anudeep ,

As I mentioned, we realized that the special character encoding and decoding was broken. So we had solved the issue as the following by changing the path string in autoloader:

old path: s3://my-sink/prod/app_daily/*/*/*/*.json.gz

new path: s3://my-sink/prod/app_daily/year=*/month=*/day=*/*.json.gz
 

After changing the path string, a clear run with the option ignoreMissingFiles=true cleared the state. After, we removed the option and pipeline continued to run successfully.

Regardless, thanks for your clear answer.

PS: I do not know our path string was the best regarding the conventions but it was working anyway. If you can suggest better and performant string formats, I appreciate any help. Thanks.