topic Re: Reading different file structures for json files in blob stores in Data Engineering

Reading different file structures for json files in blob stores

turagittech — Sun, 30 Mar 2025 22:17:21 GMT

Hi All,

We are planning to store some mixed json files in blob store and read into Databricks. I am questioning whether we should have a container for each structure or if the various tools in Databricks can successfully read the different types. I have my doubts being there is no way to separate them as it's a flat file structure regardless of what we write the files to look like in the storage to us humans.

I can filter the files in a python script, but that prevents them from things like autoloader or am I missing something in how to use autoloader in this scenario.

How have others approached this?

Re: Reading different file structures for json files in blob stores

holly — Tue, 01 Apr 2025 15:25:57 GMT

If they're all JSON but have different structure you can use the variant type

https://docs.databricks.com/aws/en/sql/language-manual/data-types/variant-type

There's a few examples in this blog too: https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark

Re: Reading different file structures for json files in blob stores

turagittech — Mon, 07 Apr 2025 05:02:48 GMT

This doesn't hit the mark as I am referring to each json file representing a different table of data. I think multiple structures in a blob container confuse a lot of tools and that means you have to do file by file loading and that is going to be the least efficient approach.

Re: Reading different file structures for json files in blob stores

holly — Fri, 02 May 2025 13:36:08 GMT

Variant should work in this scenario too. There's also been some performance improvements with variant so much more of the metadata has stats for efficient processing.

You also don't have to go file by file. You can use things like autoloader that will checkpoint all the reads, or if you want you can use "*" in a location to denote everything in that path.

Re: Reading different file structures for json files in blob stores

turagittech — Fri, 16 May 2025 00:39:01 GMT

I'll look at this once we go to production with the source files. I have split logs by file type to simplify this, but I'll go back and look again for the test space with mixed files

Re: Reading different file structures for json files in blob stores

sandeepmankikar — Fri, 16 May 2025 03:44:48 GMT

Organize files by schema into subfolders (e.g., /schema_type_a/, /schema_type_b/) in the same container.Avoid putting all JSON types in one folder