cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autolader and files with invalid path

databricks_use2
New Contributor II

I'm encountering an issue with Autoloader where it fails to process certain files due to specific characters in their names. For example, files that begin with an underscore (e.g., _data_etc.).json) are ignored and not processed. After some investigation, I found that Spark ignores files starting with a leading _ or . by default. However, I need to include these files in my processing pipeline. Is there a way to configure Autoloader to include such files?

Additionally, I'm facing another issue with certain file paths, such as s3://abc/https://some_folder/xyz. Autoloader throws error in this case saying file not found. Is there a way to either process such paths or configure Autoloader to completely ignore folders with malformed or nested paths like these?

7 REPLIES 7

ilir_nuredini
Honored Contributor

Hello @databricks_use2 ,

I don't think there is an easy way to do this. The hiddenFileFilter property is always active, and this is not just specific to Autoloader. And you may actually break very basic functionality, like reading Delta tables (as you will go inside hidden files). I suggest you employ a rename job and then read.

Hope that helps,

Best, Ilir

szymon_dybczak
Esteemed Contributor III

I'm agree with @ilir_nuredini . It's better to change source file naming convention than to try to bypass the hidden file filter. Especially when working with Delta Lake, since internal metadata and transaction logs are also stored in hidden files and folders.

Renjithrk
New Contributor II

I am just giving my suggestions
By default, Spark and Autoloader skip hidden files (those starting with _ or .). To include these in the Autoloader pipeline, use the following option: option("cloudFiles.includeHiddenFiles", "true") 

Renjith Kumar | Azure Data Engineer | Databricks | Fabric | PySpark | SQL

szymon_dybczak
Esteemed Contributor III

Hi @Renjithrk ,

There is no such an option in autoloader. Is it undocumented one or is this something suggested by chat gpt? ๐Ÿ˜„

Auto Loader options - Azure Databricks | Microsoft Learn

Thats right @szymon_dybczak  ๐Ÿ˜„

Hello @Renjithrk ,

I don't seem to find this option in any documentation. So this option is not available in the cloudFiles.
You can check this link to see all available cloudFiles options: https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options

Best, Ilir

BS_THE_ANALYST
Esteemed Contributor

@databricks_use2 I'm merely echoing the responses above but it sounds like you should be renaming those files before doing anything. 

Post here also supports this idea: https://community.databricks.com/t5/data-engineering/how-do-i-read-the-contents-of-a-hidden-file-in-... 

BS_THE_ANALYST_0-1752749589416.png

 

 

All the best,
BS