<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Event hub streaming improve processing rate in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12711#M7476</link>
    <description>&lt;P&gt;hi @Jhonatan Reyes​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.&lt;/P&gt;&lt;P&gt;Also, in this case you are using memory sink (display), I will highly recommend to test it using another type of sink.&lt;/P&gt;</description>
    <pubDate>Fri, 22 Oct 2021 22:24:24 GMT</pubDate>
    <dc:creator>jose_gonzalez</dc:creator>
    <dc:date>2021-10-22T22:24:24Z</dc:date>
    <item>
      <title>Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12697#M7462</link>
      <description>&lt;P&gt;Hi all,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I'm working with event hubs and data bricks to process and enrich data in real-time.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Doing a "simple" test, I'm getting some weird values (input rate vs processing rate) and I think I'm losing data:&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2373i20512598A08A4298/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;If you can see, there is a peak with 5k records but it is never processed in the 5 minutes after.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The script that I'm using is:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;&amp;nbsp;
conf = {}
&amp;nbsp;
conf['eventhubs.connectionString'] = sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(connectionString_bb_stream)
conf['eventhubs.consumerGroup'] = 'adb_jr_tesst'
conf['maxEventsPerTrigger'] = '350000'
conf['maxRatePerPartition'] = '350000'
conf['setStartingPosition'] = sc._jvm.org.apache.spark.eventhubs.EventPosition.fromEndOfStream
&amp;nbsp;
df = (spark.readStream 
                 .format("eventhubs") 
                 .options(**conf) 
                 .load()
            )
&amp;nbsp;
json_df = df.withColumn("body", from_json(col("body").cast('String'), jsonSchema))
Final_df = json_df.select(["sequenceNumber","offset", "enqueuedTime",col("body.*")])
Final_df = Final_df.withColumn("Key", sha2(concat(col('EquipmentId'), col('TagId'), col('Timestamp')), 256))
Final_df.display()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;can you help me to understand why I'm "losing" data or how I can improve the process?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster that I'm using is: &lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2382i55F1295C08D783DA/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I think is a cluster configuration issue, but I'm not sure how to tackle that.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for the help, guys!&lt;/P&gt;</description>
      <pubDate>Thu, 21 Oct 2021 21:50:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12697#M7462</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-21T21:50:15Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12699#M7464</link>
      <description>Hi kaniz,&lt;BR /&gt;Thanks for your reply.</description>
      <pubDate>Fri, 22 Oct 2021 04:11:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12699#M7464</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T04:11:44Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12700#M7465</link>
      <description>&lt;P&gt;Ok, the only thing I notice is that you have set a termination time which is not necessary for streaming (if you are doing real-time).&lt;/P&gt;&lt;P&gt;I also notice you do not set a checkpoint location, something you might consider.&lt;/P&gt;&lt;P&gt;You can also try to remove the maxEvent and maxRate config.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt; A snippet from the docs:&lt;/P&gt;&lt;P&gt;Here are the details of the recommended job configuration.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;B&gt;Cluster&lt;/B&gt;: Set this always to use a new cluster and use the latest Spark version (or at least version 2.1). Queries started in Spark 2.1 and above are recoverable after query and Spark version upgrades.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Alerts&lt;/B&gt;: Set this if you want email notification on failures.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Schedule&lt;/B&gt;: &lt;I&gt;Do not set a schedule&lt;/I&gt;.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Timeout&lt;/B&gt;: &lt;I&gt;Do not set a timeout.&lt;/I&gt; Streaming queries run for an indefinitely long time.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Maximum concurrent runs&lt;/B&gt;: Set to &lt;B&gt;1&lt;/B&gt;. There must be only one instance of each query concurrently active.&lt;/LI&gt;&lt;LI&gt;&lt;B&gt;Retries&lt;/B&gt;: Set to &lt;B&gt;Unlimited&lt;/B&gt;.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.databricks.com/spark/latest/structured-streaming/production.html" alt="https://docs.databricks.com/spark/latest/structured-streaming/production.html" target="_blank"&gt;https://docs.databricks.com/spark/latest/structured-streaming/production.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" alt="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" target="_blank"&gt;https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 06:12:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12700#M7465</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T06:12:19Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12701#M7466</link>
      <description>&lt;P&gt;Thanks for the answer werners. &lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;What do you mean when you say 'I have set a termination'? In wich part of the script? &lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I'm not using a check point because I just wanted see what is de behavior of the process at the beginning and try to figure out why I'm losing information. &lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 11:27:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12701#M7466</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T11:27:13Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12702#M7467</link>
      <description>&lt;P&gt;The termination time in the cluster settings&lt;/P&gt;&lt;P&gt;(Terminate after 60 minutes of inactivity)&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 11:28:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12702#M7467</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T11:28:23Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12703#M7468</link>
      <description>&lt;P&gt;Ah OK, I have that parameter only for the dev cluster. &lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;is possible that the issues is related to this? &lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 11:33:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12703#M7468</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T11:33:11Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12704#M7469</link>
      <description>&lt;P&gt;Maybe, if your cluster shuts down your streaming will be interupted.&lt;/P&gt;&lt;P&gt;But in your case that  is probably not the issue as it seems you are not running a long running streaming query.&lt;/P&gt;&lt;P&gt;But what makes you think you have missing records?&lt;/P&gt;&lt;P&gt;Did you count the #records incoming and outgoing?&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 11:36:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12704#M7469</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T11:36:45Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12705#M7470</link>
      <description>&lt;P&gt;Yes, I've counted the records for a specific range of time (5 min) and there is like +4k records missing... And is aligned with the streaming graph of the processing vs input rate... So, if I'm not losing data I'm not processing in a near real-time the records. &lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 11:44:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12705#M7470</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T11:44:01Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12706#M7471</link>
      <description>&lt;P&gt;Hm odd.  You don't use spot instances do you?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 12:20:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12706#M7471</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T12:20:36Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12707#M7472</link>
      <description>&lt;P&gt;Sorry Werners, I'm not sure what do you mean with "sport instances"&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 12:54:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12707#M7472</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T12:54:39Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12708#M7473</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2384i0D10A206A79DD131/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;These are so called 'spot' instances that you can borrow from other customers at a way cheaper price.&lt;/P&gt;&lt;P&gt;But when these customers need them, they will get evicted from your account.  In streaming that could be an issue, but I never tested that.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 13:28:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12708#M7473</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2021-10-22T13:28:41Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12709#M7474</link>
      <description>&lt;P&gt;Thanks for the explanation.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I don't have that option checked.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 13:50:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12709#M7474</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T13:50:03Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12710#M7475</link>
      <description>&lt;P&gt;Hi @Kaniz Fatma​&amp;nbsp;, sorry for bothering you,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;could you please take a look at this? &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks for your help!&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 18:42:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12710#M7475</guid>
      <dc:creator>Jreco</dc:creator>
      <dc:date>2021-10-22T18:42:30Z</dc:date>
    </item>
    <item>
      <title>Re: Event hub streaming improve processing rate</title>
      <link>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12711#M7476</link>
      <description>&lt;P&gt;hi @Jhonatan Reyes​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;How many Event hubs partitions are you readying from? your micro-batch takes a few milliseconds to complete, which I think is good time, but I would like to undertand better what are you trying to improve here.&lt;/P&gt;&lt;P&gt;Also, in this case you are using memory sink (display), I will highly recommend to test it using another type of sink.&lt;/P&gt;</description>
      <pubDate>Fri, 22 Oct 2021 22:24:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/event-hub-streaming-improve-processing-rate/m-p/12711#M7476</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-10-22T22:24:24Z</dc:date>
    </item>
  </channel>
</rss>

