<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Analyzing 23 GB JSON file in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12946#M7694</link>
    <description>&lt;P&gt;Thanks Prabakar!  We have 12 days left in our trial - we'd have to pay for the AWS VMs but would the databricks piece be free during the trial with the new, bigger cluster?&lt;/P&gt;</description>
    <pubDate>Thu, 21 Jul 2022 19:50:41 GMT</pubDate>
    <dc:creator>jayallenmn</dc:creator>
    <dc:date>2022-07-21T19:50:41Z</dc:date>
    <item>
      <title>Analyzing 23 GB JSON file</title>
      <link>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12944#M7692</link>
      <description>&lt;P&gt;Hey all,  &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We're trying to analyze the data in a 23 GB JSON file.  We're using the basic starter cluster - one node, 2 cpu x 8GB.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We can read the JSON file into a spark dataframe and print out the schema but if we try and do any operations that won't cause a collect (take, filter), the driver fails with "The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The JSON file is multiline and it sounds like the entire thing will have to be read into memory on a node - so we need a cluster of larger nodes .  What size cluster would you guys recommend?  We were looking at a cluster of 3 8 x 32s - do you think that would work?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Jay&lt;/P&gt;</description>
      <pubDate>Thu, 21 Jul 2022 04:30:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12944#M7692</guid>
      <dc:creator>jayallenmn</dc:creator>
      <dc:date>2022-07-21T04:30:13Z</dc:date>
    </item>
    <item>
      <title>Re: Analyzing 23 GB JSON file</title>
      <link>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12945#M7693</link>
      <description>&lt;P&gt;Hi @Jay Allen​&amp;nbsp;you can refer to the &lt;A href="https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-sizing-considerations" alt="https://docs.databricks.com/clusters/cluster-config-best-practices.html#cluster-sizing-considerations" target="_blank"&gt;cluster sizing doc&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 21 Jul 2022 13:45:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12945#M7693</guid>
      <dc:creator>Prabakar</dc:creator>
      <dc:date>2022-07-21T13:45:14Z</dc:date>
    </item>
    <item>
      <title>Re: Analyzing 23 GB JSON file</title>
      <link>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12946#M7694</link>
      <description>&lt;P&gt;Thanks Prabakar!  We have 12 days left in our trial - we'd have to pay for the AWS VMs but would the databricks piece be free during the trial with the new, bigger cluster?&lt;/P&gt;</description>
      <pubDate>Thu, 21 Jul 2022 19:50:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/analyzing-23-gb-json-file/m-p/12946#M7694</guid>
      <dc:creator>jayallenmn</dc:creator>
      <dc:date>2022-07-21T19:50:41Z</dc:date>
    </item>
  </channel>
</rss>

