cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Autoloader exclude one directory

Nathant93
New Contributor III

Hi,

I have a bunch of csv files in directories within an azure blob container and I am using autoloader to ingest them into a raw (bronze) table, all csvs apart from one have the same schema. Is there a way to get autoloader to ignore the directory with the one csv that has a different schema to the rest of them?

Thanks

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Nathant93

  • You can use the pathGlobFilter option to filter files based on a regular expression. For instance, if you want to skip files with filenames like A1.csvA2.csv, โ€ฆ, A9.csv, you can specify the filter as follows:
  • df = spark.read.load("/file/load/location", format="csv", schema=schema, pathGlobFilter="A[0-9].csv")
    
    • Adjust the regex pattern according to your specific use case1.
    • If you provide a schema for Auto Loader, it expects the specified columns to be included in that schema.
    • To ignore specific columns that exist in some CSV files but not others, you can set those columns to an empty string ("") in the schema. This effectively excludes them from the schema.
    • For example:
       
    • schema = "col1 STRING, col2 STRING, col3 STRING, col_to_ignore STRING"
      df = spark.read.load("/file/load/location", format="csv", schema=schema)
      
  • In this case, the col_to_ignore will be ignored when reading the CSV files.
  • Auto Loader can infer schemas from the data files. If your CSV files do not contain headers, provide the option option("header", "false").
  • Auto Loader stores schema information in a directory called _schemas at the configured cloudFiles.schemaLocation. This allows tracking schema changes over time.
  • To adjust the sample size used for schema inference, set the SQL configuration spark.databricks.clou...
  •  

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group