<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Slow performance loading checkpoint file? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13655#M8289</link>
    <description>&lt;P&gt;Also related to the above, does each microbatch always have to reload&amp;nbsp;&lt;I&gt;and&lt;/I&gt;&amp;nbsp;recompute the state? Is the last checkpoint file not cached/persisted between micro batches?&lt;/P&gt;</description>
    <pubDate>Tue, 12 Oct 2021 20:35:45 GMT</pubDate>
    <dc:creator>Matt_L</dc:creator>
    <dc:date>2021-10-12T20:35:45Z</dc:date>
    <item>
      <title>Slow performance loading checkpoint file?</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13654#M8288</link>
      <description>&lt;P&gt;Using OSS Delta, hopefully this is the right forum for this question:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hey all, I could use some help as I feel like I’m doing something wrong here.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I’m streaming from Kafka -&amp;gt; Delta on EMR/S3FS, and am seeing ever-increasingly slow batches.When looking at the stages, it looks like reading the last delta-snapshot file in is taking upwards of 15 seconds for only a 30mb file, which pushes my batch times into the 20+ second range.&lt;/P&gt;&lt;P&gt;It also is constantly writing the results of that stage to Shuffle. All this work seems to only be picked up by 1 executor as well, which I find interesting. Is this a known limitation of delta, or is there some config I can tune to reduce the impact or parallelize reading the log file? Or is there something obvious I'm missing about this? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let me know if there’s more info I can provide. I’m relatively new to delta so I’m hoping I’m just missing something obvious.Spark config as follows:&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;&lt;P&gt;SparkConf().setAppName(NAME)\                    &lt;/P&gt;&lt;P&gt;                    .set('spark.scheduler.mode','FAIR') \&lt;/P&gt;&lt;P&gt;                    .set("spark.executor.cores", exec_cores) \&lt;/P&gt;&lt;P&gt;                    .set("spark.dynamicAllocation.enabled", "true") \&lt;/P&gt;&lt;P&gt;                    .set('spark.sql.files.maxPartitionBytes', '1073741824') \&lt;/P&gt;&lt;P&gt;                    .set('spark.dynamicAllocation.minExecutors','3')\&lt;/P&gt;&lt;P&gt;                    .set('spark.driver.maxResultSize', 0) \&lt;/P&gt;&lt;P&gt;                    .set('spark.executor.heartbeatInterval', '25000')\&lt;/P&gt;&lt;P&gt;                    .set('spark.databricks.delta.vacuum.parallelDelete.enabled','true')\&lt;/P&gt;&lt;P&gt;                    .set('spark.databricks.delta.retentionDurationCheck.enabled','false')\&lt;/P&gt;&lt;P&gt;                    .set('spark.databricks.delta.checkpoint.partSize','1000000')\&lt;/P&gt;&lt;P&gt;                    .set('spark.databricks.delta.snapshotPartitions','150')&lt;/P&gt;&lt;P&gt;```&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 17:19:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13654#M8288</guid>
      <dc:creator>Matt_L</dc:creator>
      <dc:date>2021-10-12T17:19:36Z</dc:date>
    </item>
    <item>
      <title>Re: Slow performance loading checkpoint file?</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13655#M8289</link>
      <description>&lt;P&gt;Also related to the above, does each microbatch always have to reload&amp;nbsp;&lt;I&gt;and&lt;/I&gt;&amp;nbsp;recompute the state? Is the last checkpoint file not cached/persisted between micro batches?&lt;/P&gt;</description>
      <pubDate>Tue, 12 Oct 2021 20:35:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13655#M8289</guid>
      <dc:creator>Matt_L</dc:creator>
      <dc:date>2021-10-12T20:35:45Z</dc:date>
    </item>
    <item>
      <title>Re: Slow performance loading checkpoint file?</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13656#M8290</link>
      <description>&lt;P&gt;Found the answer through the Slack user group, courtesy of an Adam Binford.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I had set `delta.logRetentionDuration='24 HOURS'` but did not set `delta.deletedFileRetentionDuration`, and so the checkpoint file still had all the accumulated tombstones since the table existed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Since I was running a compactor every 15 minutes, the table itself (and thus the checkpoint file) would not consist of too many files, however since all the tombstones of every microbatch streamed in still existed, it allowed the checkpoint file to balloon in size. Once setting it to a lower interval, my batch time decreased from 20+ seconds down to about 5.&lt;/P&gt;</description>
      <pubDate>Wed, 13 Oct 2021 16:22:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/13656#M8290</guid>
      <dc:creator>Matt_L</dc:creator>
      <dc:date>2021-10-13T16:22:28Z</dc:date>
    </item>
    <item>
      <title>Re: Slow performance loading checkpoint file?</title>
      <link>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/42672#M27413</link>
      <description>&lt;P&gt;100000&lt;/P&gt;</description>
      <pubDate>Tue, 29 Aug 2023 01:27:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/slow-performance-loading-checkpoint-file/m-p/42672#M27413</guid>
      <dc:creator>asm13asmrasm773</dc:creator>
      <dc:date>2023-08-29T01:27:10Z</dc:date>
    </item>
  </channel>
</rss>

