<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance issue when using structured streaming in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/117305#M45474</link>
    <description>&lt;P&gt;Checking.&lt;/P&gt;</description>
    <pubDate>Thu, 01 May 2025 06:59:37 GMT</pubDate>
    <dc:creator>NandiniN</dc:creator>
    <dc:date>2025-05-01T06:59:37Z</dc:date>
    <item>
      <title>Performance issue when using structured streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/108735#M43136</link>
      <description>&lt;P&gt;Hi databricks community! Let me first apology for the long post.&lt;BR /&gt;&lt;BR /&gt;I'm implementing a system in databricks to read from a kafka stream into the bronze layer of a delta table. The idea is to do some operations on the data that is coming from kafka, mainly filtering and parsing its content to a delta table in unity catalog. To do that I'm using spark structured streaming.&lt;BR /&gt;&lt;BR /&gt;The problem is that I fell that I'm doing something wrong because the number of message that I can process per second seems to low to me. Let me get into the details.&lt;BR /&gt;&lt;BR /&gt;I have a kafka topic that receives a baseline of 300k messages per minute (~ 6MB ) with peaks up to 10M messages per minute. This topic has 8 partitions.&lt;BR /&gt;&lt;BR /&gt;Then I have a job compute cluster with the following configurations:&lt;BR /&gt;- Databricks runtime 15.4 LTS&lt;BR /&gt;- Worker type Standard_F8 min workers 1 max workers 4&lt;BR /&gt;- Driver type Standard_F8&lt;BR /&gt;&lt;BR /&gt;In the cluster I only run a task which takes the data from the kafka cluster, does some filtering operations, including one from_json operation, and stores the data to a unity table. The structured stream is set to be triggered every minute and has the following configurations:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"spark.sql.streaming.noDataMicroBatches.enabled"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"false"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"spark.sql.shuffle.partitions"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"64"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"spark.databricks.streaming.statefulOperator.asyncCheckpoint.enabled"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;"spark.sql.streaming.stateStore.providerClass"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"com.databricks.sql.streaming.state.RocksDBStateStoreProvider"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"spark.sql.streaming.statefulOperator.stateRebalancing.enabled"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;"spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled"&lt;/SPAN&gt;&lt;SPAN&gt;: &lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;BR /&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;maxOffsetsPerTrigger":&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;10000000&lt;BR /&gt;&lt;/SPAN&gt;&lt;SPAN&gt;All the other properties are the default values.&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;I have set a the maxOffsetsPerTrigger in order to prevent out of memory issues in the cluster.&lt;BR /&gt;&lt;BR /&gt;Right now, with those configurations, I can at maximum process about 4M messages per minute. This means that the stream that should run every minute takes more than two minutes to complete. What is strange is that only two nodes of the job compute are active (32GB, 16 cores) with CPU on 10%.&lt;BR /&gt;&lt;BR /&gt;Although this is enough during normal operations I have a lot of unprocessed messages in the back log that I would like to process faster. Does this throughput seems reasonable to you? It feels like I just need to process so little data and even this is not working. Is there anything I can do to improve the performance of this kafka consumer. Thank you for your help.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 04 Feb 2025 10:13:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/108735#M43136</guid>
      <dc:creator>pvaz</dc:creator>
      <dc:date>2025-02-04T10:13:36Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue when using structured streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/117305#M45474</link>
      <description>&lt;P&gt;Checking.&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2025 06:59:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/117305#M45474</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-05-01T06:59:37Z</dc:date>
    </item>
    <item>
      <title>Re: Performance issue when using structured streaming</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/117366#M45484</link>
      <description>&lt;P&gt;Have you tried using&amp;nbsp;&lt;A href="https://docs.databricks.com/aws/en/connect/streaming/kafka" target="_self"&gt;&lt;SPAN&gt;minPartitions&lt;/SPAN&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Minimum number of partitions to read from Kafka. You can configure Spark to use an arbitrary minimum of partitions to read from Kafka using the&amp;nbsp;&lt;CODE&gt;minPartitions&lt;/CODE&gt;&amp;nbsp;option. Normally Spark has a 1-1 mapping of Kafka topicPartitions to Spark partitions consuming from Kafka. If you set the&amp;nbsp;&lt;CODE&gt;minPartitions&lt;/CODE&gt;&amp;nbsp;option to a value greater than your Kafka topicPartitions, Spark will divvy up large Kafka partitions to smaller pieces. This option can be set at times of peak loads, data skew, and as your stream is falling behind to increase processing rate. It comes at a cost of initializing Kafka consumers at each trigger, which may impact performance if you use SSL when connecting to Kafka.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 May 2025 11:46:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-issue-when-using-structured-streaming/m-p/117366#M45484</guid>
      <dc:creator>NandiniN</dc:creator>
      <dc:date>2025-05-01T11:46:38Z</dc:date>
    </item>
  </channel>
</rss>

