Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:
Setting the enableChangeDataFeed as true.
Reading the complete raw data usingreadStream.
Writing as delta table usingwriteStreamto the azure blob storage.
Reading the change feed data of this delta table usingspark.read.format("delta").option("readChangeFeed", "true")...
Performing operations on the change feed table usingwithColumn(Performing operations oncontentcolumn as well, might be taking a lot of computation).
Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: java.lang.OutOfMemoryError. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.
Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?