How can process 3.5GB GZ (~90GB) nested JSON and convert them to tabular formats with less processing time and optimized cost in Azure Databricks?

sanchit_popli — Mon, 22 Aug 2022 21:43:15 GMT

I have a total of 5000 files (Nested JSON ~ 3.5 GB). I have written a code which converts the json to Table in minutes (for JSON size till 1 GB) but when I am trying to process 3.5GB GZ json it is mostly getting failed because of Garbage collection. I have tried multiple clusters as well, still it is taking 18 minutes to read the file only and since it is a nested json it reads it only as a single record.

Please find the sample JSON Structure.

Code snippet:

Reading Code:

I am looking a way first to process one 3.5 GZ file and after that my focus will be working on 5000 similar files. I am searching for a way which will be more optimized and cost effective. Currently I am using Azure Databricks but I am open you use any other alternate technology as well.

topic How can process 3.5GB GZ (~90GB) nested JSON and convert them to tabular formats with less processing time and optimized cost in Azure Databricks? in Data Engineering

How can process 3.5GB GZ (~90GB) nested JSON and convert them to tabular formats with less processing time and optimized cost in Azure Databricks?