Hi there,
I read data from Azure Event Hub and after manipulating with data I write the dataframe back to Event Hub (I use this connector for that):
#read data
df = (spark.readStream
.format("eventhubs")
.options(**ehConf)
.load()
)
#some data manipulation
#write data
ds = df \
.select("body", "partitionKey") \
.writeStream \
.format("eventhubs") \
.options(**output_ehConf) \
.option("checkpointLocation", "/checkpoin/eventhub-to-eventhub/savestate.txt") \
.trigger(processingTime='1 seconds') \
.start()
In this case, I get high storage costs, which far exceed my computational costs (4 times). The expense is caused by a large number of transactions to the storage:
I tried to reduce the number of transactions by using processingTime as a trigger, but it didn't bring any significant result (for me, a minimal delay is critical).
Question: am I using structured streaming correctly, and if so, how can I optimize storage costs?
Thank you for your time!