java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:

Setting the enableChangeDataFeed as true.
Reading the complete raw data using readStream.
Writing as delta table using writeStream to the azure blob storage.
Reading the change feed data of this delta table using spark.read.format("delta").option("readChangeFeed", "true")...
Performing operations on the change feed table using withColumn (Performing operations on content column as well, might be taking a lot of computation).

Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.

Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?

0 REPLIES 0

Photos

Upload Upload
URL URL
Saved Photos Saved Photos

Upload location

Upload location

Add Photos to Album:

New Album

Drag here to start uploading

Drag photos here or

Tap for upload options

You must install or upgrade to the latest version of Adobe Flash Player before you can upload images.