<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Optimizing a batch load process, reading with the JDBC driver in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimizing-a-batch-load-process-reading-with-the-jdbc-driver/m-p/21086#M14324</link>
    <description>&lt;P&gt;I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Some details:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I have 4 workers, 8 GB&lt;/LI&gt;&lt;LI&gt;The source table is around 80 million rows&lt;/LI&gt;&lt;LI&gt;I am using a "dateloaded" as the partition column.&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?&lt;/LI&gt;&lt;LI&gt;numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read? &lt;/P&gt;</description>
    <pubDate>Tue, 22 Nov 2022 22:47:12 GMT</pubDate>
    <dc:creator>huyd</dc:creator>
    <dc:date>2022-11-22T22:47:12Z</dc:date>
    <item>
      <title>Optimizing a batch load process, reading with the JDBC driver</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-a-batch-load-process-reading-with-the-jdbc-driver/m-p/21086#M14324</link>
      <description>&lt;P&gt;I am doing a batch load, using the JDBC driver from a database table. I am noticing in Sparkui, that there is both memory and disk spill, but only on one executor. I am also, noticing that when trying to use the JDBC parallel read, it seems to run slower, then leaving it to default. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Some details:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;I have 4 workers, 8 GB&lt;/LI&gt;&lt;LI&gt;The source table is around 80 million rows&lt;/LI&gt;&lt;LI&gt;I am using a "dateloaded" as the partition column.&lt;/LI&gt;&lt;LI&gt;&amp;nbsp;sqlContext.setConf("spark.sql.shuffle.partitions","4"), set the shuffle partition size. Is it correct to set the shuffle the executor counts?&lt;/LI&gt;&lt;LI&gt;numPartitions=12, is it correct that it's ideal to have 3-4 task per executor?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The "dateloaded" is not a primary key, but is index. Is the spill a result of data skew? or have I set too few/many partitions for the shuffle or read? &lt;/P&gt;</description>
      <pubDate>Tue, 22 Nov 2022 22:47:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-a-batch-load-process-reading-with-the-jdbc-driver/m-p/21086#M14324</guid>
      <dc:creator>huyd</dc:creator>
      <dc:date>2022-11-22T22:47:12Z</dc:date>
    </item>
  </channel>
</rss>

