Suddenly at "2025-10-23T14:12:48.409+00:00", coming file paths from file notification queue started to be urlencoded. Hence, our pipeline gets file not found exception. I think something has changed suddenly and broke notification system. Here are the details:
Broken file path started to come from notification queue:
my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz
The path discovered by directory listing:
my-sink/prod/app_daily/year=2025/month=10/day=23/app_daily+5+35499048368.json.gz
I found these by investigating the output of these query:
select * from cloud_files_state(TABLE(my_catalog.my_schema.app_daily_stream_v2))
order by create_time asc;
since (=) characted encoded as (%3D), our Declarative Pipeline fires this error:
org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 76ce493e-ed1e-48ce-bbda-1a9bb85cc9f7, runId = bee6661b-4319-469c-8a72-040dc517e9ff] terminated with exception: Exception thrown in awaitResult: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. [CLOUD_FILE_SOURCE_FILE_NOT_FOUND] A file notification was received for file:s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. but it does not exist anymore. Please ensure that files are not deleted before they are processed. To continue your stream, you can set the Spark SQL configuration spark.sql.files.ignoreMissingFiles to true.
I checked s3 bucket and saw that the file is there. Since autoloader try to go to the encoded path, it started to fire this error.
My problem is this: Currently, I cannot run the pipeline neither in file notification mode nor in directory listing mode without using skipMissingFiles=true option since auto loader state is dirty with these uncommitted wrong file paths. I don't want to useSkipMissingFiles since it will skip all the data. Also, I do not want to run full refresh since the source is too big. I need to clear those broken urls from the autoloader's state.