Databricks Community

charliemerrell · ‎04-23-2025

I want to run auto loader on some very large json files. I don't actually care about the data inside the files, just the file paths of the blobs. If I do something like

```

spark.readStream

.format("cloudFiles")

.option("cloudFiles.format", "json")

.option("cloudFiles.schemaLocation", source_operations_checkpoint_path)

.load(source_operations_path)

.select("_metadata")

```

will Databricks know not to reading all the files or will it read them in anyway, then discard?

Renu_ · ‎04-24-2025

Hi @charliemerrell, even if you’re just selecting _metadata, Auto Loader still needs to read parts of the files, mainly to gather schema info and essential metadata. It won’t fully read the contents, but it doesn’t completely skip them either.

If you're only interested in things like file paths and not the actual data, switching to the "binaryFile" format is a better and more efficient option.

lingareddy_Alva · ‎04-24-2025

Hi @charliemerrell

Yes, Databricks will still open and parse the JSON files, even if you're only selecting _metadata.
It must infer schema and perform basic parsing, unless you explicitly avoid it.
So, even if you do:
.select("_metadata")

It doesn't skip reading the file contents — it still downloads, parses, and caches to process the data.

LR