07-16-2025 08:12 AM - edited 07-16-2025 08:13 AM
I'm encountering an issue with Autoloader where it fails to process certain files due to specific characters in their names. For example, files that begin with an underscore (e.g., _data_etc.).json) are ignored and not processed. After some investigation, I found that Spark ignores files starting with a leading _ or . by default. However, I need to include these files in my processing pipeline. Is there a way to configure Autoloader to include such files?
Additionally, I'm facing another issue with certain file paths, such as s3://abc/https://some_folder/xyz. Autoloader throws error in this case saying file not found. Is there a way to either process such paths or configure Autoloader to completely ignore folders with malformed or nested paths like these?
07-17-2025 02:11 AM
Hello @databricks_use2 ,
I don't think there is an easy way to do this. The hiddenFileFilter property is always active, and this is not just specific to Autoloader. And you may actually break very basic functionality, like reading Delta tables (as you will go inside hidden files). I suggest you employ a rename job and then read.
Hope that helps,
Best, Ilir
07-17-2025 02:56 AM
I'm agree with @ilir_nuredini . It's better to change source file naming convention than to try to bypass the hidden file filter. Especially when working with Delta Lake, since internal metadata and transaction logs are also stored in hidden files and folders.
07-17-2025 03:06 AM
I am just giving my suggestions
By default, Spark and Autoloader skip hidden files (those starting with _ or .). To include these in the Autoloader pipeline, use the following option: option("cloudFiles.includeHiddenFiles", "true")
07-17-2025 03:21 AM
Hi @Renjithrk ,
There is no such an option in autoloader. Is it undocumented one or is this something suggested by chat gpt? 😄
07-17-2025 03:29 AM
Thats right @szymon_dybczak 😄
07-17-2025 03:28 AM
Hello @Renjithrk ,
I don't seem to find this option in any documentation. So this option is not available in the cloudFiles.
You can check this link to see all available cloudFiles options: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options
Best, Ilir
07-17-2025 03:53 AM
@databricks_use2 I'm merely echoing the responses above but it sounds like you should be renaming those files before doing anything.
Post here also supports this idea: https://community.databricks.com/t5/data-engineering/how-do-i-read-the-contents-of-a-hidden-file-in-...
All the best,
BS
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now