cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading different file structures for json files in blob stores

turagittech
New Contributor III

Hi All,

We are planning to store some mixed json files in blob store and read into Databricks. I am questioning whether we should have a container for each structure or if the various tools in Databricks can successfully read the different types. I have my doubts being there is no way to separate them as it's a flat file structure regardless of what we write the files to look like in the storage to us humans.

I can filter the files in a python script, but that prevents them from things like autoloader or am I missing something in how to use autoloader in this scenario.

How have others approached this?

5 REPLIES 5

holly
Databricks Employee
Databricks Employee

If they're all JSON but have different structure you can use the variant type

https://docs.databricks.com/aws/en/sql/language-manual/data-types/variant-type

There's a few examples in this blog too: https://www.databricks.com/blog/introducing-open-variant-data-type-delta-lake-and-apache-spark

turagittech
New Contributor III

This doesn't hit the mark as I am referring to each json file representing a different table of data. I think multiple structures in a blob container confuse a lot of tools and that means you have to do file by file loading and that is going to be the least efficient approach.

holly
Databricks Employee
Databricks Employee

Variant should work in this scenario too. There's also been some performance improvements with variant so much more of the metadata has stats for efficient processing. 

You also don't have to go file by file. You can use things like autoloader that will checkpoint all the reads, or if you want you can use "*" in a location to denote everything in that path.

turagittech
New Contributor III

I'll look at this once we go to production with the source files. I have split logs by file type to simplify this, but I'll go back and look again for the test space with mixed files

sandeepmankikar
Contributor

Organize files by schema into subfolders (e.g., /schema_type_a/, /schema_type_b/) in the same container.Avoid putting all JSON types in one folder