cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Analyzing 23 GB JSON file

jayallenmn
New Contributor III

Hey all,

We're trying to analyze the data in a 23 GB JSON file. We're using the basic starter cluster - one node, 2 cpu x 8GB.

We can read the JSON file into a spark dataframe and print out the schema but if we try and do any operations that won't cause a collect (take, filter), the driver fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

The JSON file is multiline and it sounds like the entire thing will have to be read into memory on a node - so we need a cluster of larger nodes . What size cluster would you guys recommend? We were looking at a cluster of 3 8 x 32s - do you think that would work?

Jay

2 REPLIES 2

Prabakar
Esteemed Contributor III

Hi @Jay Allen​ you can refer to the cluster sizing doc.

jayallenmn
New Contributor III

Thanks Prabakar! We have 12 days left in our trial - we'd have to pay for the AWS VMs but would the databricks piece be free during the trial with the new, bigger cluster?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group