3 weeks ago - last edited 3 weeks ago
Suddenly at "2025-10-23T14:12:48.409+00:00", coming file paths from file notification queue started to be urlencoded. Hence, our pipeline gets file not found exception. I think something has changed suddenly and broke notification system. Here are the details:
Broken file path started to come from notification queue:
my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gzThe path discovered by directory listing:
my-sink/prod/app_daily/year=2025/month=10/day=23/app_daily+5+35499048368.json.gzI found these by investigating the output of these query:
select * from cloud_files_state(TABLE(my_catalog.my_schema.app_daily_stream_v2))
order by create_time asc;
since (=) characted encoded as (%3D), our Declarative Pipeline fires this error:
org.apache.spark.sql.streaming.StreamingQueryException: [STREAM_FAILED] Query [id = 76ce493e-ed1e-48ce-bbda-1a9bb85cc9f7, runId = bee6661b-4319-469c-8a72-040dc517e9ff] terminated with exception: Exception thrown in awaitResult: [FAILED_READ_FILE.DBR_FILE_NOT_EXIST] Error while reading file s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. [CLOUD_FILE_SOURCE_FILE_NOT_FOUND] A file notification was received for file:s3://my-sink/prod/app_daily/year%3D2025/month%3D10/day%3D23/app_daily%2B5%2B35499048368.json.gz. but it does not exist anymore. Please ensure that files are not deleted before they are processed. To continue your stream, you can set the Spark SQL configuration spark.sql.files.ignoreMissingFiles to true.
I checked s3 bucket and saw that the file is there. Since autoloader try to go to the encoded path, it started to fire this error.
My problem is this: Currently, I cannot run the pipeline neither in file notification mode nor in directory listing mode without using skipMissingFiles=true option since auto loader state is dirty with these uncommitted wrong file paths. I don't want to useSkipMissingFiles since it will skip all the data. Also, I do not want to run full refresh since the source is too big. I need to clear those broken urls from the autoloader's state.
3 weeks ago
Hello @aonurdemir,
Could you please re-run your pipeline now and check? This issue should be mitigated now. It is due to a recent internal bug that led to the unexpected handling of file paths with special characters.
You should set ignoreMissingFiles to true to get past this error and you can remove the flag once you get past this error
3 weeks ago
Hello @aonurdemir,
Could you please re-run your pipeline now and check? This issue should be mitigated now. It is due to a recent internal bug that led to the unexpected handling of file paths with special characters.
You should set ignoreMissingFiles to true to get past this error and you can remove the flag once you get past this error
2 weeks ago
Hey @K_Anudeep ,
Thanks for letting us know that this is a bug and is mitigated. I tried testing it today again as i was getting the same error last week but it doesn't occur now anymore
2 weeks ago
Hello @K_Anudeep ,
As I mentioned, we realized that the special character encoding and decoding was broken. So we had solved the issue as the following by changing the path string in autoloader:
old path: s3://my-sink/prod/app_daily/*/*/*/*.json.gz
After changing the path string, a clear run with the option ignoreMissingFiles=true cleared the state. After, we removed the option and pipeline continued to run successfully.
Regardless, thanks for your clear answer.
PS: I do not know our path string was the best regarding the conventions but it was working anyway. If you can suggest better and performant string formats, I appreciate any help. Thanks.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now