java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline
Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-04-2024 12:17 AM
I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:
- Setting the enableChangeDataFeed as true.
- Reading the complete raw data using readStream.
- Writing as delta table using writeStream to the azure blob storage.
- Reading the change feed data of this delta table using spark.read.format("delta").option("readChangeFeed", "true")...
- Performing operations on the change feed table using withColumn (Performing operations on content column as well, might be taking a lot of computation).
Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.
Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?
Labels:
- Labels:
-
Spark
0 REPLIES 0

