Spark is not reading Kinesis Data as fast as specified

938452 — Fri, 29 Sep 2023 15:53:25 GMT

Hi Databricks community team,

I have code as below

"""

df = spark.readStream \

.format("kinesis") \

.option("endpointUrl", endpoint_url) \

.option("streamName", stream_name) \

.option("initialPosition", "latest") \

.option("consumerMode", "efo") \

.option("maxFetchDuration", "500ms") \

.load()

"""

With maxFetchDuration, I thought it would fetch data pretty fast. But it felt like it was still doing batch read of multiple seconds. So I added a timestamp to track when it starts to get processed, as well as to trackapproximateArrivalTimestamp from Kinesis:

"""

df = df \

.selectExpr("approximateArrivalTimestamp", "cast (data as STRING) data") \

.withColumn("processed_timestamp", F.current_timestamp()) \

.select(F.col("approximateArrivalTimestamp"), F.col("processed_timestamp"), F.from_json("data", SOME_SCHEMA).alias("data_fields")) \

.select('approximateArrivalTimestamp', "processed_timestamp", 'data_fields.*')

"""

I do satisfy # cores in cluster >= 2 * (# Kinesis shards) / shardsPerTask -> (8 cores * 4 worker) >= 2 * 64 / 5 -> 32 >= 25.6. I'm using latest Databricks Runtime 14.0 (Spark 3.5.0). This is the only Kinesis consumer to ensure there is no another consumer competing for resource and also got EFO on.

There is roughly 30 seconds gap between approximateArrivalTimestamp and processed_timestamp consistently. What can I do to lower the gap ?

Attaching evidence of Spark processing in same chunk despite the data arriving to Kinesis few seconds apart.

topic Spark is not reading Kinesis Data as fast as specified in Data Engineering

Spark is not reading Kinesis Data as fast as specified