java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline

hv129 — Thu, 04 Jan 2024 08:17:35 GMT

I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:

Setting the enableChangeDataFeed as true.
Reading the complete raw data using readStream.
Writing as delta table using writeStream to the azure blob storage.
Reading the change feed data of this delta table using spark.read.format("delta").option("readChangeFeed", "true")...
Performing operations on the change feed table using withColumn (Performing operations on content column as well, might be taking a lot of computation).

Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.

Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?

topic java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline in Data Engineering

java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline