cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Will auto loader read files if it doesn't need to?

charliemerrell
New Contributor

I want to run auto loader on some very large json files. I don't actually care about the data inside the files, just the file paths of the blobs. If I do something like

```

    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", source_operations_checkpoint_path)
        .load(source_operations_path)
        .select("_metadata")
```
will Databricks know not to reading all the files or will it read them in anyway, then discard?
2 REPLIES 2

Renu_
Contributor

Hi @charliemerrell, even if you’re just selecting _metadata, Auto Loader still needs to read parts of the files, mainly to gather schema info and essential metadata. It won’t fully read the contents, but it doesn’t completely skip them either.

If you're only interested in things like file paths and not the actual data, switching to the "binaryFile" format is a better and more efficient option.

LRALVA
Honored Contributor

Hi @charliemerrell 

Yes, Databricks will still open and parse the JSON files, even if you're only selecting _metadata.
It must infer schema and perform basic parsing, unless you explicitly avoid it.
So, even if you do:
.select("_metadata")

It doesn't skip reading the file contents — it still downloads, parses, and caches to process the data.

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now