cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Preferred compression format for ingesting large amounts of JSON files with Autoloader

Volker
New Contributor III

Hello Databricks Community,

in an IOT context we plan to ingest a large amount of JSON files (~2 Million per Day). The JSON files are in json lines format und need to be compressed on the IOT devices. We can provide suggestions for the type of compression that is optimal for ingesting these files.
The internet resources that we found suggest different compression formats that all have their pros and cons. We have currently looked at gzip and bzip2 compressions and it looks like bzip2 could be more performant than gzip.

Does anyone have experience with such a usecase and could provide some arguments in favor of a certain compression format or recommend other compression formats?

Thanks in advance!  

3 REPLIES 3

jose_gonzalez
Moderator
Moderator

Could you provide more details? for example, your source will be the JSON files, is your sink a Delta table? (assuming you will use auto loader to ingest your data) if that the case, then your Delta tables will be compressed already. 

Kaniz
Community Manager
Community Manager

Hi @Volker  , We haven't heard from you since the last response from @jose_gonzalez , and I was checking back to see if his suggestions helped you. 

Or else, If you have any solution, please share it with the community as it can be helpful to others.

Also, please don't forget to click on the "Accept As Solution" button whenever the information provided helps resolve your question.

Volker
New Contributor III

Hi, 

sorry I guess my response wasn't sent. The source are JSON files that are uploaded to an S3 bucket. The sink will be a Delta Table and we are using autoloader.
The question was about the compression format of the incoming JSON files, e.g. if it would be better to compress them using gzip or bzip2 or any other format. The compression ratio is not considered, it is just a matter of performance. 

Thank you! 

@jose_gonzalez 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.