<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-on-data-ingestion-and-storage/m-p/56410#M30546</link>
    <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;OL&gt;&lt;LI&gt;Setting the enableChangeDataFeed as true.&lt;/LI&gt;&lt;LI&gt;Reading the complete raw data using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;readStream&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Writing as delta table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;writeStream&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to the azure blob storage.&lt;/LI&gt;&lt;LI&gt;Reading the change feed data of this delta table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;spark.read.format("delta").option("readChangeFeed", "true")...&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Performing operations on the change feed table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;withColumn&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Performing operations on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;content&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;column as well, might be taking a lot of computation).&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: &lt;STRONG&gt;java.lang.OutOfMemoryError&lt;/STRONG&gt;. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.&lt;/P&gt;&lt;P&gt;Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Thu, 04 Jan 2024 08:17:35 GMT</pubDate>
    <dc:creator>hv129</dc:creator>
    <dc:date>2024-01-04T08:17:35Z</dc:date>
    <item>
      <title>java.lang.OutOfMemoryError on Data Ingestion and Storage Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-on-data-ingestion-and-storage/m-p/56410#M30546</link>
      <description>&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;I have around 25GBs of data in my Azure storage. I am performing data ingestion using Autoloader in databricks. Below are the steps I am performing:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV class=""&gt;&lt;DIV class=""&gt;&lt;OL&gt;&lt;LI&gt;Setting the enableChangeDataFeed as true.&lt;/LI&gt;&lt;LI&gt;Reading the complete raw data using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;readStream&lt;/STRONG&gt;.&lt;/LI&gt;&lt;LI&gt;Writing as delta table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;writeStream&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;to the azure blob storage.&lt;/LI&gt;&lt;LI&gt;Reading the change feed data of this delta table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;spark.read.format("delta").option("readChangeFeed", "true")...&lt;/STRONG&gt;&lt;/LI&gt;&lt;LI&gt;Performing operations on the change feed table using&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;withColumn&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;(Performing operations on&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;content&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;column as well, might be taking a lot of computation).&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;Now I am trying to save this computed pyspark dataframe to my catalog but getting the error: &lt;STRONG&gt;java.lang.OutOfMemoryError&lt;/STRONG&gt;. My databricks cluster has 1 Driver with 16GBs of memory and 4 nodes, max of 10 Workers with 16GBs of memory and 4 nodes each.&lt;/P&gt;&lt;P&gt;Is there a need to include more resources to the cluster or is there some way to optimize or replace the current pipeline?&lt;/P&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 04 Jan 2024 08:17:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/java-lang-outofmemoryerror-on-data-ingestion-and-storage/m-p/56410#M30546</guid>
      <dc:creator>hv129</dc:creator>
      <dc:date>2024-01-04T08:17:35Z</dc:date>
    </item>
  </channel>
</rss>

