โ04-11-2024 01:53 PM - edited โ04-11-2024 02:00 PM
We are doing a first time implementation of data streaming reading from a partitioned pulsar topics to a delta table managed by UC. We are unable to scale the job beyond about ~ 40k msgs/sec. Beyond 40k msgs/sec , the job fails. I'd imagine Databricks to be able to ingest and process much more than 40k msgs/sec. I did find this article but it does not provide any benchmarking details.
Are there any benchmarking information that is available for Pulsar Streaming?
Attached is the compute we are using for the job. Partial Exception stack trace below. The timeout occurs only when rate of message production to pulsar is above 40 - 45 k msgs/sec at which point broker backlog raises up causing huge backlogs.
โ04-11-2024 02:14 PM
โ04-12-2024 10:21 AM
@shan_chandra any suggestions?
โ04-12-2024 10:28 AM
@surband - can you please add this library to the cluster and try and let us know - io.streamnative.connectors:pulsar-spark-connector_2.12:3.4.0.3
Reference: https://github.com/streamnative/pulsar-spark
โ04-12-2024 10:35 AM
But that jar already seems to be available in the cluster class path, I checked in the Spark UI -->Environment -->Classpath Entries. Do you still suggest I go ahead and install? @shan_chandra
โ04-12-2024 10:42 AM
yes. Install from the maven library and see if it works. Per Open source Pulsar Spark connector documentation, Write to Pulsar sink is supported. However, within DBR only read from Pulsar source is supported as of now (as the feature is in public preview).
โ04-12-2024 11:00 AM
@shan_chandra I was able to get the jar from https://mvnrepository.com/artifact/io.streamnative.connectors/pulsar-spark-connector_2.12/3.4.0.3
But unable to install from as the wizard does not allow me to select the file - see attached. Is there an alternative way of installing. Please let me know. Thank You !
โ04-12-2024 12:45 PM
@surband - can you please use the maven option and install the connector. (I was able to attach it successfully in my local).
โ04-12-2024 02:51 PM
@shan_chandra I was able to install and rerun the tests and I am able to see 70k avg process/sec - and job is not crashing any longer. My goal is to see if I can achieve 500k msgs/sec. I will continue testing next week.
Will databricks runtime update it's package with the library we needed to install manually - what will be ETA?
Thanks so much for your help.
โ05-14-2024 07:56 AM
@shan_chandra Is it Databrick's official recommendation that customers manually install the following to achieve higher throughputs ?
https://mvnrepository.com/artifact/io.streamnative.connectors/pulsar-spark-connector_2.12/3.4.0.3
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group