<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3 in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24568#M17098</link>
    <description>&lt;P&gt;@Karli Watsica​&amp;nbsp;, thanks for help. This issue has been fixed in databricks 10.4 and spark 3.3.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-36649" alt="https://issues.apache.org/jira/browse/SPARK-36649" target="_blank"&gt;[SPARK-36649]&lt;/A&gt;&amp;nbsp;[SQL] Support&amp;nbsp;Trigger.AvailableNow&amp;nbsp;on Kafka data source&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/release-notes/runtime/10.4.html" target="test_blank"&gt;https://docs.databricks.com/release-notes/runtime/10.4.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 29 Mar 2022 03:57:11 GMT</pubDate>
    <dc:creator>SimonY</dc:creator>
    <dc:date>2022-03-29T03:57:11Z</dc:date>
    <item>
      <title>Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24566#M17096</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I  ran a spark stream job to ingest data from kafka to test Trigger.AvailableNow.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What's environment the job run ?&lt;/P&gt;&lt;P&gt;1: Databricks runtime 10.3&lt;/P&gt;&lt;P&gt;2: Azure cloud&lt;/P&gt;&lt;P&gt;3: 1 Driver node + 3 work nodes( 14GB, 4core)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;val maxOffsetsPerTrigger = "500"&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;spark.conf.set("spark.databricks.delta.autoCompact.enabled",&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"auto")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;...&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;val rdf = spark&lt;/P&gt;&lt;P&gt;&amp;nbsp;.readStream&lt;/P&gt;&lt;P&gt;&amp;nbsp;.format("kafka")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("kafka.security.protocol", "SASL_PLAINTEXT")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("kafka.sasl.mechanism",&amp;nbsp;&amp;nbsp;"SCRAM-SHA-512")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("kafka.sasl.jaas.config",&amp;nbsp;"&amp;lt;&amp;gt;")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("kafka.bootstrap.servers", servers)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("subscribe",&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;topic)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("startingOffsets",&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;"earliest")&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("maxOffsetsPerTrigger",&amp;nbsp;&amp;nbsp;maxOffsetsPerTrigger)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.load()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;rdf.writeStream&lt;/P&gt;&lt;P&gt;&amp;nbsp;.format("delta")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.outputMode("append")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("mergeSchema", "true")&lt;/P&gt;&lt;P&gt;&amp;nbsp;.option("checkpointLocation", ckpPath)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.trigger(Trigger.AvailableNow)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.start(tabPath)&lt;/P&gt;&lt;P&gt;&amp;nbsp;.awaitTermination()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I expected to see:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;1: The spark stream job can read all data from Kafka and then quit&lt;/P&gt;&lt;P&gt;2: The spark stream will apply maxOffsetsPerTrigger  for each micro batch&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What I see:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;the Kafka topic has four partitions, it takes 5 hours to generate 4 huge data files.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;part-00000-89afacf1-f2e6-4904-b313-080d48034859-c000.snappy.parquet&lt;/P&gt;&lt;P&gt;3/25/2022, 9:50:48 PM&lt;/P&gt;&lt;P&gt;Hot (Inferred)&lt;/P&gt;&lt;P&gt;Block blob&lt;/P&gt;&lt;P&gt;14.39 GiB&lt;/P&gt;&lt;P&gt;Available&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;part-00001-cf932ee2-8535-4dd6-9dab-e94b9292a438-c000.snappy.parquet&lt;/P&gt;&lt;P&gt;3/25/2022, 6:15:36 PM&lt;/P&gt;&lt;P&gt;Hot (Inferred)&lt;/P&gt;&lt;P&gt;Block blob&lt;/P&gt;&lt;P&gt;14.38 GiB&lt;/P&gt;&lt;P&gt;Available&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;part-00002-7d481793-10dc-4739-8c20-972cb6f18fd6-c000.snappy.parquet&lt;/P&gt;&lt;P&gt;3/25/2022, 6:15:22 PM&lt;/P&gt;&lt;P&gt;Hot (Inferred)&lt;/P&gt;&lt;P&gt;Block blob&lt;/P&gt;&lt;P&gt;14.41 GiB&lt;/P&gt;&lt;P&gt;Available&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;part-00003-17c88f26-f152-4b27-80cf-5ae372662950-c000.snappy.parquet&lt;/P&gt;&lt;P&gt;3/25/2022, 9:48:14 PM&lt;/P&gt;&lt;P&gt;Hot (Inferred)&lt;/P&gt;&lt;P&gt;Block blob&lt;/P&gt;&lt;P&gt;14.43 GiB&lt;/P&gt;&lt;P&gt;Available&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 26 Mar 2022 05:41:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24566#M17096</guid>
      <dc:creator>SimonY</dc:creator>
      <dc:date>2022-03-26T05:41:37Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24567#M17097</link>
      <description>&lt;P&gt;We’re constantly working to improve our features based on feedback like this, so I’ll be sure to share your request to the API product team.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://www.liteblue.biz/" alt="https://www.liteblue.biz/" target="_blank"&gt;usps liteblue&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Mar 2022 05:19:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24567#M17097</guid>
      <dc:creator>Eulaliasw</dc:creator>
      <dc:date>2022-03-28T05:19:47Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24568#M17098</link>
      <description>&lt;P&gt;@Karli Watsica​&amp;nbsp;, thanks for help. This issue has been fixed in databricks 10.4 and spark 3.3.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://issues.apache.org/jira/browse/SPARK-36649" alt="https://issues.apache.org/jira/browse/SPARK-36649" target="_blank"&gt;[SPARK-36649]&lt;/A&gt;&amp;nbsp;[SQL] Support&amp;nbsp;Trigger.AvailableNow&amp;nbsp;on Kafka data source&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/release-notes/runtime/10.4.html" target="test_blank"&gt;https://docs.databricks.com/release-notes/runtime/10.4.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Mar 2022 03:57:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24568#M17098</guid>
      <dc:creator>SimonY</dc:creator>
      <dc:date>2022-03-29T03:57:11Z</dc:date>
    </item>
    <item>
      <title>Re: Trigger.AvailableNow does not support maxOffsetsPerTrigger in Databricks runtime 10.3</title>
      <link>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24569#M17099</link>
      <description>&lt;P&gt;You'd be better off with 1 node with 12 cores than 3 nodes with 4 each.  You're shuffles are going to be much better one 1 machine.  &lt;/P&gt;</description>
      <pubDate>Tue, 29 Mar 2022 12:01:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/trigger-availablenow-does-not-support-maxoffsetspertrigger-in/m-p/24569#M17099</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2022-03-29T12:01:43Z</dc:date>
    </item>
  </channel>
</rss>

