cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can process 3.5GB GZ (~90GB) nested JSON and convert them to tabular formats with less processing time and optimized cost in Azure Databricks?

sanchit_popli
New Contributor II

I have a total of 5000 files (Nested JSON ~ 3.5 GB). I have written a code which converts the json to Table in minutes (for JSON size till 1 GB) but when I am trying to process 3.5GB GZ json it is mostly getting failed because of Garbage collection. I have tried multiple clusters as well, still it is taking 18 minutes to read the file only and since it is a nested json it reads it only as a single record.

Please find the sample JSON Structure.

Data frame structureCode snippet:

CodeReading Code:

Reading CodeI am looking a way first to process one 3.5 GZ file and after that my focus will be working on 5000 similar files. I am searching for a way which will be more optimized and cost effective. Currently I am using Azure Databricks but I am open you use any other alternate technology as well.

0 REPLIES 0
Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!