cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Spark is not reading Kinesis Data as fast as specified

938452
New Contributor III

Hi Databricks community team,

I have code as below

"""

df = spark.readStream \
.format("kinesis") \
.option("endpointUrl", endpoint_url) \
.option("streamName", stream_name) \
.option("initialPosition", "latest") \
.option("consumerMode", "efo") \
.option("maxFetchDuration", "500ms") \
.load()
"""
 
With maxFetchDuration, I thought it would fetch data pretty fast. But it felt like it was still doing batch read of multiple seconds. So I added a timestamp to track when it starts to get processed, as well as to trackapproximateArrivalTimestamp from Kinesis:
"""
df = df \
.selectExpr("approximateArrivalTimestamp", "cast (data as STRING) data") \
.withColumn("processed_timestamp", F.current_timestamp()) \
.select(F.col("approximateArrivalTimestamp"), F.col("processed_timestamp"), F.from_json("data", SOME_SCHEMA).alias("data_fields")) \
.select('approximateArrivalTimestamp', "processed_timestamp", 'data_fields.*')
"""
 
I do satisfy # cores in cluster >= 2 * (# Kinesis shards) / shardsPerTask -> (8 cores * 4 worker) >= 2 * 64 / 5 -> 32 >= 25.6. I'm using latest Databricks Runtime 14.0 (Spark 3.5.0). This is the only Kinesis consumer to ensure there is no another consumer competing for resource and also got EFO on.
 
There is roughly 30 seconds gap between approximateArrivalTimestamp and processed_timestamp consistently. What can I do to lower the gap ?
 
Attaching evidence of Spark processing in same chunk despite the data arriving to Kinesis few seconds apart.
0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group