<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Parsing 5 GB json file is running long on cluster in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28099#M19937</link>
    <description>&lt;P&gt;with multiline = true, the json is read as a whole and processed as such.&lt;/P&gt;&lt;P&gt;I'd try with a beefier cluster.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 01 Mar 2022 08:48:29 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2022-03-01T08:48:29Z</dc:date>
    <item>
      <title>Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28095#M19933</link>
      <description>&lt;P&gt;I was creating delta table from ADLS json input file. but the job was running long while creating delta table from json. Below is my cluster configuration. Is the issue related to cluster config ? Do I need to upgrade the cluster config ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster was created for non-prod environment and we have complex batch ETL ie.., join, aggregation. Shall i create a small cluster with 400GB memory and 50 cores ? Please advise on this. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Input JSON file size - 5 GB&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;standard_D3_V2 &lt;/P&gt;&lt;P&gt;14 GB memory and 4 cores&lt;/P&gt;&lt;P&gt;worker node - min -2 and max -8&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;executor type -standard_D3_V2 &lt;/P&gt;&lt;P&gt;14GB memory and 4 cores&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Note- the cluster was ALLPURPOSE &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Feb 2022 17:26:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28095#M19933</guid>
      <dc:creator>Jana</dc:creator>
      <dc:date>2022-02-15T17:26:54Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28096#M19934</link>
      <description>&lt;P&gt;Hello, @Jana A​! It's nice to meet you! My name is Piper, and I'm a moderator for Databricks. Welcome to the community. Thanks for your question. We'll give your peers a chance to respond and then we'll circle back if we need to.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks in advance for your patience. &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Feb 2022 16:29:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28096#M19934</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-02-16T16:29:48Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28097#M19935</link>
      <description>&lt;P&gt;Have you checked &lt;A href="https://community.databricks.com/s/question/0D53f00001Q0Rq9CAF/bufferholder-exceeded-on-json-flattening" alt="https://community.databricks.com/s/question/0D53f00001Q0Rq9CAF/bufferholder-exceeded-on-json-flattening" target="_blank"&gt;this topic&lt;/A&gt;?  There might be some ideas there.&lt;/P&gt;</description>
      <pubDate>Thu, 17 Feb 2022 07:23:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28097#M19935</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-02-17T07:23:56Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28098#M19936</link>
      <description>&lt;P&gt;Note - Df was created with multi line true.The job was ​running long and slowdown the cluster performance.  Can you please help me on the issue&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Tue, 01 Mar 2022 08:33:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28098#M19936</guid>
      <dc:creator>Jana</dc:creator>
      <dc:date>2022-03-01T08:33:27Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28099#M19937</link>
      <description>&lt;P&gt;with multiline = true, the json is read as a whole and processed as such.&lt;/P&gt;&lt;P&gt;I'd try with a beefier cluster.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Mar 2022 08:48:29 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28099#M19937</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-01T08:48:29Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28100#M19938</link>
      <description>&lt;P&gt;Yes, the issue was with multiline = true property.  Spark is reading as whole. How to resolve the issue? ​ &lt;/P&gt;</description>
      <pubDate>Thu, 03 Mar 2022 17:55:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28100#M19938</guid>
      <dc:creator>Jana</dc:creator>
      <dc:date>2022-03-03T17:55:17Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28101#M19939</link>
      <description>&lt;P&gt;So the databricks docs state the following:&lt;/P&gt;&lt;P&gt;&lt;I&gt;You can read JSON files in &lt;/I&gt;&lt;A href="https://docs.databricks.com/data/data-sources/read-json.html#single-line-mode" alt="https://docs.databricks.com/data/data-sources/read-json.html#single-line-mode" target="_blank"&gt;&lt;I&gt;single-line&lt;/I&gt;&lt;/A&gt;&lt;I&gt; or &lt;/I&gt;&lt;A href="https://docs.databricks.com/data/data-sources/read-json.html#multi-line-mode" alt="https://docs.databricks.com/data/data-sources/read-json.html#multi-line-mode" target="_blank"&gt;&lt;I&gt;multi-line&lt;/I&gt;&lt;/A&gt;&lt;I&gt; mode. In single-line mode, a file can be split into many parts and read in parallel. In multi-line mode, a file is loaded as a whole entity and cannot be split.&lt;/I&gt;&lt;/P&gt;&lt;P&gt;What this means is that you will not have parallelism while reading the json.&lt;/P&gt;&lt;P&gt;So you have a few options:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;do not use multiline.  This is only possible if your json file contains one json object per line.  You can try to see if it works&lt;/LI&gt;&lt;LI&gt;use a larger cluster.  The driver will read the json file so the driver needs enough memory.  The number of cores is less important.&lt;/LI&gt;&lt;LI&gt;if you can: split up the file&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 03 Mar 2022 18:28:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28101#M19939</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-03T18:28:58Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28102#M19940</link>
      <description>&lt;P&gt;Increase​ driver memory or executor memory,? I have changed my cluster executor conf from 14 GB to 28 GB.  With the changes,  we were able to complete the job without an issue.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Mar 2022 17:40:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28102#M19940</guid>
      <dc:creator>Jana</dc:creator>
      <dc:date>2022-03-04T17:40:07Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28103#M19941</link>
      <description>&lt;P&gt;Hi @Jana A​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Did @Werner Stinckens​&amp;nbsp;reply helped you resolve your issue? if yes, could you mark his response as "best response" please?&lt;/P&gt;</description>
      <pubDate>Mon, 07 Mar 2022 23:14:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/28103#M19941</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2022-03-07T23:14:40Z</dc:date>
    </item>
    <item>
      <title>Re: Parsing 5 GB json file is running long on cluster</title>
      <link>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/96560#M39292</link>
      <description>&lt;P&gt;Splitting the file was the easiest solution for me. I was trying to load a 3GB JSON file into a delta table. I'm working on a cluster with 128GB memory. The resulting error message does not help identify the issue. I split the file into three 1GB files. Worked like a charm&lt;/P&gt;</description>
      <pubDate>Mon, 28 Oct 2024 17:47:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/parsing-5-gb-json-file-is-running-long-on-cluster/m-p/96560#M39292</guid>
      <dc:creator>AlexG</dc:creator>
      <dc:date>2024-10-28T17:47:03Z</dc:date>
    </item>
  </channel>
</rss>

