<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic java.lang.OutOfMemoryError: GC overhead limit exceeded in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/m-p/30048#M21726</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file.&lt;/P&gt;
&lt;P&gt;The file is a CSV file 217GB zise&lt;/P&gt;
&lt;P&gt;Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0&lt;/P&gt;
&lt;P&gt;configutation:&lt;/P&gt;
&lt;P&gt;spark.app.id:local-1443956477103 &lt;/P&gt;
&lt;P&gt;spark.app.name:Spark shell&lt;/P&gt;
&lt;P&gt; spark.cores.max:100 &lt;/P&gt;
&lt;P&gt;spark.driver.cores:24 &lt;/P&gt;
&lt;P&gt;spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal &lt;/P&gt;
&lt;P&gt;spark.driver.maxResultSize:300g&lt;/P&gt;
&lt;P&gt; spark.driver.port:55123 &lt;/P&gt;
&lt;P&gt;spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory spark.eventLog.enabled:true &lt;/P&gt;
&lt;P&gt;spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native &lt;/P&gt;
&lt;P&gt;spark.executor.id:driver spark.executor.memory:200g &lt;/P&gt;
&lt;P&gt;spark.fileserver.uri:http://172.31.34.242:51424 &lt;/P&gt;
&lt;P&gt;spark.jars: spark.master:local[*] &lt;/P&gt;
&lt;P&gt;spark.repl.class.uri:http://172.31.34.242:58244 &lt;/P&gt;
&lt;P&gt;spark.scheduler.mode:FIFO &lt;/P&gt;
&lt;P&gt;spark.serializer:org.apache.spark.serializer.KryoSerializer &lt;/P&gt;
&lt;P&gt;spark.storage.memoryFraction:0.9 &lt;/P&gt;
&lt;P&gt;spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de &lt;/P&gt;
&lt;P&gt;spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088&lt;/P&gt;
&lt;P&gt;here is what I ran:&lt;/P&gt;
&lt;P&gt;val testrdd = sc.textFile("&lt;/P&gt;</description>
    <pubDate>Sun, 04 Oct 2015 11:16:34 GMT</pubDate>
    <dc:creator>t_ras</dc:creator>
    <dc:date>2015-10-04T11:16:34Z</dc:date>
    <item>
      <title>java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/m-p/30048#M21726</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I get java.lang.OutOfMemoryError: GC overhead limit exceeded when trying coutn action on a file.&lt;/P&gt;
&lt;P&gt;The file is a CSV file 217GB zise&lt;/P&gt;
&lt;P&gt;Im using a 10 r3.8xlarge(ubuntu) machines cdh 5.3.6 and spark 1.2.0&lt;/P&gt;
&lt;P&gt;configutation:&lt;/P&gt;
&lt;P&gt;spark.app.id:local-1443956477103 &lt;/P&gt;
&lt;P&gt;spark.app.name:Spark shell&lt;/P&gt;
&lt;P&gt; spark.cores.max:100 &lt;/P&gt;
&lt;P&gt;spark.driver.cores:24 &lt;/P&gt;
&lt;P&gt;spark.driver.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native spark.driver.host:ip-172-31-34-242.us-west-2.compute.internal &lt;/P&gt;
&lt;P&gt;spark.driver.maxResultSize:300g&lt;/P&gt;
&lt;P&gt; spark.driver.port:55123 &lt;/P&gt;
&lt;P&gt;spark.eventLog.dir:hdfs://ip-172-31-34-242.us-west-2.compute.internal:8020/user/spark/applicationHistory spark.eventLog.enabled:true &lt;/P&gt;
&lt;P&gt;spark.executor.extraLibraryPath:/opt/cloudera/parcels/CDH-5.3.6-1.cdh5.3.6.p0.11/lib/hadoop/lib/native &lt;/P&gt;
&lt;P&gt;spark.executor.id:driver spark.executor.memory:200g &lt;/P&gt;
&lt;P&gt;spark.fileserver.uri:http://172.31.34.242:51424 &lt;/P&gt;
&lt;P&gt;spark.jars: spark.master:local[*] &lt;/P&gt;
&lt;P&gt;spark.repl.class.uri:http://172.31.34.242:58244 &lt;/P&gt;
&lt;P&gt;spark.scheduler.mode:FIFO &lt;/P&gt;
&lt;P&gt;spark.serializer:org.apache.spark.serializer.KryoSerializer &lt;/P&gt;
&lt;P&gt;spark.storage.memoryFraction:0.9 &lt;/P&gt;
&lt;P&gt;spark.tachyonStore.folderName:spark-88bd9c44-d626-4ad2-8df3-f89df4cb30de &lt;/P&gt;
&lt;P&gt;spark.yarn.historyServer.address:http://ip-172-31-34-242.us-west-2.compute.internal:18088&lt;/P&gt;
&lt;P&gt;here is what I ran:&lt;/P&gt;
&lt;P&gt;val testrdd = sc.textFile("&lt;/P&gt;</description>
      <pubDate>Sun, 04 Oct 2015 11:16:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/m-p/30048#M21726</guid>
      <dc:creator>t_ras</dc:creator>
      <dc:date>2015-10-04T11:16:34Z</dc:date>
    </item>
    <item>
      <title>Re: java.lang.OutOfMemoryError: GC overhead limit exceeded</title>
      <link>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/m-p/30049#M21727</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Looks like the following property is pretty high, which consumes a lot of memory on your executors when you cache the dataset. &lt;/P&gt;
&lt;P&gt;"spark.storage.memoryFraction:0.9"&lt;/P&gt;
&lt;P&gt;This could likely be solved by changing the configuration. Take a look at the upstream tuning docs:&lt;/P&gt;
&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/tuning.html" target="test_blank"&gt;http://spark.apache.org/docs/latest/tuning.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Oct 2015 16:38:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-gc-overhead-limit-exceeded/m-p/30049#M21727</guid>
      <dc:creator>miklos</dc:creator>
      <dc:date>2015-10-09T16:38:23Z</dc:date>
    </item>
  </channel>
</rss>

