cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Will auto loader read files if it doesn't need to?

charliemerrell
New Contributor

I want to run auto loader on some very large json files. I don't actually care about the data inside the files, just the file paths of the blobs. If I do something like

```

    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .option("cloudFiles.schemaLocation", source_operations_checkpoint_path)
        .load(source_operations_path)
        .select("_metadata")
```
will Databricks know not to reading all the files or will it read them in anyway, then discard?
2 REPLIES 2

Renu_
Valued Contributor II

Hi @charliemerrell, even if youโ€™re just selecting _metadata, Auto Loader still needs to read parts of the files, mainly to gather schema info and essential metadata. It wonโ€™t fully read the contents, but it doesnโ€™t completely skip them either.

If you're only interested in things like file paths and not the actual data, switching to the "binaryFile" format is a better and more efficient option.

lingareddy_Alva
Honored Contributor III

Hi @charliemerrell 

Yes, Databricks will still open and parse the JSON files, even if you're only selecting _metadata.
It must infer schema and perform basic parsing, unless you explicitly avoid it.
So, even if you do:
.select("_metadata")

It doesn't skip reading the file contents โ€” it still downloads, parses, and caches to process the data.

LR

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now