cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline

hv129
New Contributor
I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:
  1. Setting the enableChangeDataFeed as true.
  2. Reading the complete raw data using readStream.
  3. Writing as delta table using writeStream to the azure blob storage.
  4. Reading the change feed data of this delta table using spark.read.format("delta").option("readChangeFeed", "true")...
  5. Performing operations on the change feed table using withColumn (Performing operations on content column as well, might be taking a lot of computation).

Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.

Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?

0 REPLIES 0

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group