The job (written in PySpark) uses azure eventhub as source and use Databricks delta table as sink. The job is hosted in Azure Databricks.
Transformation part is simple, the message body is converted from bytes to json string, the json string is then added as a new column, something like:
df = spark.readStream.format("eventhubs").load()
df = df.withColumn("stringBody", col("Body").cast(stringType()))
query = df.writeStream.format("delta").outputMode("append").option("checkpointLocation", "someLocation").toTable("somedeltatable")
The job is running ok for the first few days, I observe the metric of eventhub, inbound and outbound traffic are about the same. Then I see the outbound traffic seems to be capped, something like in the figure:
At first I thought maybe the spark cluster is stressed, I checked the metric, both cpu and memory usage are well under 50%, the cluster is not overloaded at all.
I then thought maybe eventhub is capping the outbound traffic for some reason, so I increased 'throughput unit' of eventhub, but it didn't make any difference.
I'm thinking, if the job ran ok for the first few days, then there shouldn't be any major issue code wise. If the job cluster is not stressed, then it's not about scaling issue.
How should I approach this problem? Any pointer or direction would be appreciated.