<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Getting OOM error while processing xml data in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105827#M42275</link>
    <description>&lt;P&gt;Hi Avinash,&lt;/P&gt;&lt;P&gt;Already tried.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="EktaPuri_0-1736999629659.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14159i1961294DFA5693DF/image-size/medium?v=v2&amp;amp;px=400" role="button" title="EktaPuri_0-1736999629659.png" alt="EktaPuri_0-1736999629659.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Below you can see memory utilization is less only&lt;/P&gt;</description>
    <pubDate>Thu, 16 Jan 2025 03:54:33 GMT</pubDate>
    <dc:creator>EktaPuri</dc:creator>
    <dc:date>2025-01-16T03:54:33Z</dc:date>
    <item>
      <title>Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105819#M42269</link>
      <description>&lt;P&gt;I have a table in which one of the column contains xml raw data , approx. size of each row is 3MB, The volume of data is very huge, I have chunked it into 1 hour processing, On observing Memory Utilization metrics everything seems fine, but receiving below error&amp;nbsp;&lt;SPAN&gt;org.apache.spark.SparkException: Job aborted due to stage failure: org.apache.spark.memory.SparkOutOfMemoryError: Photon ran out of memory while executing this query. Photon failed to reserve 6.7 MiB for BufferPool, in Current Column Batch, in FileScanNode(id=2513, output_schema=[string, string, string, bool, timestamp, date]), in task.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;Tried Solution.&lt;/P&gt;&lt;P&gt;Allocate more memory - Doesn't work, most of the memory is free&lt;/P&gt;&lt;P&gt;Increase overhead memory - Doesn't work&lt;/P&gt;&lt;P&gt;Disable autoscaling&amp;nbsp;&lt;/P&gt;&lt;P&gt;Photon is disabled only&lt;/P&gt;&lt;P&gt;Compute Configuration:&amp;nbsp;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;DRV:&amp;nbsp;&lt;SPAN&gt;15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;Photon Acceleration: Disable&lt;/LI&gt;&lt;LI&gt;Worker Type:&amp;nbsp;&lt;SPAN&gt;Standard_E32_v3, driver type is same&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Auto Scaling: 1-8&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 01:56:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105819#M42269</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T01:56:40Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105820#M42270</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/143632"&gt;@EktaPuri&lt;/a&gt;,&lt;/P&gt;
&lt;P&gt;Was this failure observed before? can you share more context on what you are doing?&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 02:22:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105820#M42270</guid>
      <dc:creator>Alberto_Umana</dc:creator>
      <dc:date>2025-01-16T02:22:31Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105821#M42271</link>
      <description>&lt;P&gt;Hi&amp;nbsp; &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/106294"&gt;@Alberto_Umana&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In here, from xml_raw data, we are extracting tags and their respective hex string values and decoding them and creating a Json object over it using rdd.map. Earlier it used to work, since data load was not that heavy, Now we are doing history load (not all history load, only files which were missed or are new), on 1 hour interval processing, I am joining the new records with older processed files, since I don't want to process file, which were only processed earlier, so broadcast that frame it only contain 1 column so only 400 mb size approx. it have, But major issue that in bronze layer the data provided seems to have high number of duplicates so we had to do drop duplicates on logfile_nm, that's one pain point. In here want to understand BufferPool memory is the part of executor memory, and upon investigation executor memory utilization seems fine, so where exactly the memory leak problem arising.&lt;/P&gt;&lt;P&gt;Also more info about error:&amp;nbsp;&amp;nbsp;&lt;SPAN&gt;Total task memory (including non-Photon): 1772.5 MiB task: allocated 1647.0 MiB, tracked 1772.5 MiB, untracked allocated 0.0 B, peak 1772.5 MiB BufferPool: allocated 2.5 MiB, tracked 128.0 MiB, untracked allocated 0.0 B, peak 128.0 MiB DataWriter: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B FileScanNode(id=2161, output_schema=[string, string, string, bool, timestamp, date]): allocated 1644.5 MiB, tracked 1644.5 MiB, untracked allocated 0.0 B, peak 1644.5 MiB Current Column Batch: allocated 1472.9 MiB, tracked 1473.0 MiB, untracked allocated 0.0 B, peak 1473.0 MiB BufferPool: allocated 1472.9 MiB, tracked 1473.0 MiB, untracked allocated 0.0 B, peak 1473.0 MiB dictionary values: allocated 1024.0 B, tracked 1024.0 B, untracked allocated 0.0 B, peak 1024.0 B dictionary values: allocated 4.0 KiB, tracked 4.0 KiB, untracked allocated 0.0 B, peak 4.0 KiB dictionary values: allocated 1024.0 B, tracked 1024.0 B, untracked allocated 0.0 B, peak 1024.0 B dictionary values: allocated 8.0 KiB, tracked 8.0 KiB, untracked allocated 0.0 B, peak 8.0 KiB dictionary values: allocated 1024.0 B, tracked 1024.0 B, untracked allocated 0.0 B, peak 1024.0 B&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 02:44:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105821#M42271</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T02:44:35Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105822#M42272</link>
      <description>&lt;P&gt;Note: Photon is not enabled&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 02:45:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105822#M42272</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T02:45:33Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105825#M42274</link>
      <description>&lt;P&gt;Try using memory intensive cluster with more driver and worker memory than now.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 03:48:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105825#M42274</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2025-01-16T03:48:47Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105827#M42275</link>
      <description>&lt;P&gt;Hi Avinash,&lt;/P&gt;&lt;P&gt;Already tried.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="EktaPuri_0-1736999629659.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14159i1961294DFA5693DF/image-size/medium?v=v2&amp;amp;px=400" role="button" title="EktaPuri_0-1736999629659.png" alt="EktaPuri_0-1736999629659.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Below you can see memory utilization is less only&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 03:54:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105827#M42275</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T03:54:33Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105830#M42278</link>
      <description>&lt;P&gt;Are you sure that the logic is being executed on workers and not entirely on driver? There are cases where the entire logic has to be executed on driver, in which the worker memory is under-utilised, likewise for spark.sql statements as spark session cannot be sent to multiple workers, so in that case the whole logic will run on driver memory which will lead to OOM on driver but under-utilised for worker memory.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 04:04:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105830#M42278</guid>
      <dc:creator>Avinash_Narala</dc:creator>
      <dc:date>2025-01-16T04:04:25Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105833#M42279</link>
      <description>&lt;P&gt;Hi ,&lt;/P&gt;&lt;P&gt;That I am sure that logic is not running in driver.&lt;/P&gt;&lt;P&gt;Below is driver utilization, that's the question I am not sure where exactly memory leak is happening based upon error or logs.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="EktaPuri_0-1737001087654.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/14161i8E4A3C85D9CBA65E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="EktaPuri_0-1737001087654.png" alt="EktaPuri_0-1737001087654.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jan 2025 04:19:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105833#M42279</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T04:19:02Z</dc:date>
    </item>
    <item>
      <title>Re: Getting OOM error while processing xml data</title>
      <link>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105834#M42280</link>
      <description>&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp;filteredDataframe&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;spark.&lt;/SPAN&gt;&lt;SPAN&gt;table&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;'&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;sourceConfig[&lt;/SPAN&gt;&lt;SPAN&gt;"srcDatabaseName"&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;sourceConfig[&lt;/SPAN&gt;&lt;SPAN&gt;"srcTableName"&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;SPAN&gt;}&lt;/SPAN&gt;&lt;SPAN&gt;'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;.filter&lt;/SPAN&gt;&lt;SPAN&gt;(f.col(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;load_dt&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt;&lt;SPAN&gt;current_start_time.&lt;/SPAN&gt;&lt;SPAN&gt;date&lt;/SPAN&gt;&lt;SPAN&gt;())&lt;/SPAN&gt;&lt;SPAN&gt;.filter&lt;/SPAN&gt;&lt;SPAN&gt;(f.col(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;load_ts&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;gt;=&lt;/SPAN&gt;&lt;SPAN&gt;current_start_time)&lt;/SPAN&gt;&lt;SPAN&gt;.filter&lt;/SPAN&gt;&lt;SPAN&gt;(f.col(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;load_ts&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;&amp;lt;&lt;/SPAN&gt;&lt;SPAN&gt;current_end_time)&lt;/SPAN&gt;&lt;SPAN&gt;.filter&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;col1&lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt;&lt;SPAN&gt;'value'"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;.filter&lt;/SPAN&gt;&lt;SPAN&gt;(f.col(&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;col2&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;SPAN&gt;==&lt;/SPAN&gt;&lt;SPAN&gt;"true"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;select&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"col3"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"col4"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"col5"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"col6"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"col7"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"col8"&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;dropDuplicates&lt;/SPAN&gt;&lt;SPAN&gt;([&lt;/SPAN&gt;&lt;SPAN&gt;"col4"&lt;/SPAN&gt;&lt;SPAN&gt;])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Dataframe&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;filteredDataframe.&lt;/SPAN&gt;&lt;SPAN&gt;join&lt;/SPAN&gt;&lt;SPAN&gt;(f.&lt;/SPAN&gt;&lt;SPAN&gt;broadcast&lt;/SPAN&gt;&lt;SPAN&gt;(metadata),&lt;/SPAN&gt;&lt;SPAN&gt;"col4"&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;SPAN&gt;"leftanti"&lt;/SPAN&gt;&lt;SPAN&gt;), Joining and Duplicating is happening, Have a look here, I have increased the autobroadcast threshold to 1g, even size of broadcast is approx 400MB&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 16 Jan 2025 04:24:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/getting-oom-error-while-processing-xml-data/m-p/105834#M42280</guid>
      <dc:creator>EktaPuri</dc:creator>
      <dc:date>2025-01-16T04:24:19Z</dc:date>
    </item>
  </channel>
</rss>

