Spark is not reading Kinesis Data as fast as specified

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Hi Databricks community team,

I have code as below

"""

df = spark.readStream \

.format("kinesis") \

.option("endpointUrl", endpoint_url) \

.option("streamName", stream_name) \

.option("initialPosition", "latest") \

.option("consumerMode", "efo") \

.option("maxFetchDuration", "500ms") \

.load()

"""

With maxFetchDuration, I thought it would fetch data pretty fast. But it felt like it was still doing batch read of multiple seconds. So I added a timestamp to track when it starts to get processed, as well as to trackapproximateArrivalTimestamp from Kinesis:

"""

df = df \

.selectExpr("approximateArrivalTimestamp", "cast (data as STRING) data") \

.withColumn("processed_timestamp", F.current_timestamp()) \

.select(F.col("approximateArrivalTimestamp"), F.col("processed_timestamp"), F.from_json("data", SOME_SCHEMA).alias("data_fields")) \

.select('approximateArrivalTimestamp', "processed_timestamp", 'data_fields.*')

"""

I do satisfy # cores in cluster >= 2 * (# Kinesis shards) / shardsPerTask -> (8 cores * 4 worker) >= 2 * 64 / 5 -> 32 >= 25.6. I'm using latest Databricks Runtime 14.0 (Spark 3.5.0). This is the only Kinesis consumer to ensure there is no another consumer competing for resource and also got EFO on.

There is roughly 30 seconds gap between approximateArrivalTimestamp and processed_timestamp consistently. What can I do to lower the gap ?

Attaching evidence of Spark processing in same chunk despite the data arriving to Kinesis few seconds apart.

29 KB