Databricks Community

tramtran · ‎07-28-2024

Hi everyone,

I have a streaming job with 29 notebooks that runs continuously. Initially, I allocated 28 GB of memory to the driver, but the job failed with a "Driver Out of Memory" error after 4 hours of execution.

To address this, I increased the driver's memory to 56 GB, but the job still failed after running for 60 hours.

Here's the code I’m using to ingest data from Kafka:

(
raw_kafka_events
.writeStream
.format('delta')
.option('key.converter', 'org.apache.kafka.connect.json.ByteArrayConverter')
.option('value.converter', 'org.apache.kafka.connect.json.ByteArrayConverter')
.option('checkpointLocation', checkpoint_location)
.outputMode('append')
.toTable(table_path)
)

And here's the code for upserting data from Kafka to the target table:

(
spark
.readStream
.format("delta")
.table(source_table)
.withColumn("key", F.from_json(F.col("key"), key_schema))
.withColumn("value", F.from_json(F.col("value"), value_schema))
.withColumn("landing_time", F.col("_ingested_time"))
.writeStream
.foreachBatch(merge_stream)
.option("checkpointLocation", checkpoint_location)
.start()
)

Despite these configurations, the driver continues to run out of memory. Any suggestions on how to further address this issue?

xorbix_rshiva · ‎07-29-2024

Does your merge_stream function contain any stateful operations, such as aggregation or deduplication logic? If so, your streaming job may be accumulating state in memory over time, which will eventually result in OOM error. If this is the case, you will see memory usage increase over time, until it hits the amount allocated to your cluster (View memory utilization in the "Metrics" tab for your cluster).

To resolve the issue, you could try adding a watermark to your upsert streaming query:

(
spark
.readStream
.format("delta")
.table(source_table)
.withColumn("key", F.from_json(F.col("key"), key_schema))
.withColumn("value", F.from_json(F.col("value"), value_schema))
.withColumn("landing_time", F.col("_ingested_time"))
.withWatermark("event_time", "1 hour")
.writeStream
.foreachBatch(merge_stream)
.option("checkpointLocation", checkpoint_location)
.start()
)

Applying a watermark will control the amount of memory used by your streaming job, and it can also improve job performance by keeping your stream state to a manageable size. If you apply a watermark, you should set the lateness threshold according to your requirements. For example, if your stream contains deduplication logic and you expect no duplicate records to arrive within more than 10 minutes of each other, you could set your watermark timestamp to 10 minutes.

Check out this documentation for more info on watermarks: https://docs.databricks.com/en/structured-streaming/watermarks.html

View solution in original post

xorbix_rshiva · ‎07-29-2024

Does your merge_stream function contain any stateful operations, such as aggregation or deduplication logic? If so, your streaming job may be accumulating state in memory over time, which will eventually result in OOM error. If this is the case, you will see memory usage increase over time, until it hits the amount allocated to your cluster (View memory utilization in the "Metrics" tab for your cluster).

To resolve the issue, you could try adding a watermark to your upsert streaming query:

(
spark
.readStream
.format("delta")
.table(source_table)
.withColumn("key", F.from_json(F.col("key"), key_schema))
.withColumn("value", F.from_json(F.col("value"), value_schema))
.withColumn("landing_time", F.col("_ingested_time"))
.withWatermark("event_time", "1 hour")
.writeStream
.foreachBatch(merge_stream)
.option("checkpointLocation", checkpoint_location)
.start()
)

Applying a watermark will control the amount of memory used by your streaming job, and it can also improve job performance by keeping your stream state to a manageable size. If you apply a watermark, you should set the lateness threshold according to your requirements. For example, if your stream contains deduplication logic and you expect no duplicate records to arrive within more than 10 minutes of each other, you could set your watermark timestamp to 10 minutes.

Check out this documentation for more info on watermarks: https://docs.databricks.com/en/structured-streaming/watermarks.html

tramtran · ‎07-29-2024

Thanks @xorbix_rshiva for your reply,

There was deduplication logic in my code. Could I use _source_cdc_time for the watermark in this case?

def merge_stream(microBatchDF, i):

microBatchDF.createOrReplaceTempView("vw_delta")

microBatchDF._jdf.sparkSession().sql("""

MERGE INTO customer as tg

USING (

SELECT * FROM (

SELECT

key.customerid,

key.customerkey,

value.after.customername,

--meta data

value.op,

current_timestamp() as _ingested_time,

to_timestamp(value.source.ts_ms/1000) as _source_cdc_time,

to_timestamp(value.ts_ms/1000) as _kafka_created_time,

landing_time as _landing_time,

row_number() over(PARTITION BY key.customerid, key.cusomterkey order by value.source.ts_ms desc) as rank

FROM vw_delta

WHERE value.op IS NOT NULL

) as t

WHERE rank = 1

) as src

ON tg.customerid= src.customerid

AND tg.customerkey= src.customerkey

WHEN MATCHED AND src.op = 'd' THEN DELETE

WHEN MATCHED AND src.op != 'd' THEN UPDATE SET *

WHEN NOT MATCHED AND src.op != 'd' THEN INSERT *

""")

xorbix_rshiva · ‎07-30-2024

It looks like _source_cdc_time is the timestamp for when the CDC transaction occurred in your source system. This would be a good choice for a timestamp column for your watermark, since you would be deduping values according to the time the transactions actually occurred, not the timestamp when they are ingested and processed in Databricks.

Databricks Community

Driver: Out of Memory

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI