<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: &amp;quot;Photon ran out of memory&amp;quot; while when trying to get the unique Id from sql query in Machine Learning</title>
    <link>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3354#M118</link>
    <description>&lt;P&gt;that collect statement moves all data to the driver.  So you lose all parallelism and the driver has to do all the processing.  If you beef up your driver, it might work.&lt;/P&gt;</description>
    <pubDate>Fri, 09 Jun 2023 09:59:47 GMT</pubDate>
    <dc:creator>-werners-</dc:creator>
    <dc:date>2023-06-09T09:59:47Z</dc:date>
    <item>
      <title>"Photon ran out of memory" while when trying to get the unique Id from sql query</title>
      <link>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3353#M117</link>
      <description>&lt;P&gt;I am trying to get all unique id from sql query and I always run out of memory&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;with following code&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;var df = spark.sql("select concat_ws(';',view.MATNR,view.WERKS) from hive_metastore.dqaas.temp_view as view join hive_metastore.dqaas.t_dqaas_marc as marc on marc.MATNR = view.MATNR where view.WERKS NOT IN ('BR91', 'BR92', 'BR94', 'BR97', 'BR98', 'BR9A', 'BR9B', 'BR9C', 'BR9D', 'BR9L', 'BR9X','CN9S', 'XM93', 'ZA90', 'ZA93') and marc.HERKL = view.HERKL and marc.LVORM  = ' '")
&amp;nbsp;
var distinctValue: Set[String] = df.rdd.mapPartitions(data =&amp;gt; {
    var unqiueIdSet = data.map(row =&amp;gt; row.getAs[String](0)).toSet
    Iterator(unqiueIdSet)
}).collect.flatten.toSet&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;data in temp_view = 5000&lt;/P&gt;&lt;P&gt;data in t_dqaas_marc= 22354457&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;output of query gives me 4 lakhs plus records&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;exception I am getting &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Job aborted due to stage failure: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
&amp;nbsp;
Caused by: SparkOutOfMemoryError: Photon ran out of memory while executing this query.
Photon failed to reserve 512.0 MiB for hash table buckets, in SparseHashedRelation, in BuildHashedRelation, in BroadcastHashedRelation(spark_plan_id=59815).
Memory usage:
BroadcastHashedRelation(spark_plan_id=59815): allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
  BuildHashedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
    PartitionedRelation: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
      partition 0: allocated 1310.0 MiB, tracked 1310.0 MiB, untracked allocated 0.0 B, peak 1310.0 MiB
        rows: allocated 890.0 MiB, tracked 890.0 MiB, untracked allocated 0.0 B, peak 890.0 MiB
        var-len data: allocated 420.0 MiB, tracked 420.0 MiB, untracked allocated 0.0 B, peak 420.0 MiB
    SparseHashedRelation: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table var-len key data: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table payloads: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B
      hash table buckets: allocated 0.0 B, tracked 0.0 B, untracked allocated 0.0 B, peak 0.0 B&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;My cluster configuration&lt;/P&gt;&lt;P&gt;2-16 Workers&lt;/P&gt;&lt;P&gt;32-256&amp;nbsp;GB Memory8-64&amp;nbsp;Cores1 Driver&lt;/P&gt;&lt;P&gt;16&amp;nbsp;GB Memory,&amp;nbsp;4&amp;nbsp;CoresRuntime&lt;/P&gt;&lt;P&gt;11.3.x-scala2.12&lt;/P&gt;&lt;P&gt;Photon&lt;/P&gt;&lt;P&gt;Standard_D4as_v5&lt;/P&gt;&lt;P&gt;6–34&amp;nbsp;DBU/h&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I have tried saving the output to another table i ran into same issue. &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").saveAsTable(tempTableName)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;No matter what operation I do on above sql query dataframe, I always end up into out of memory exception&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 09 Jun 2023 05:34:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3353#M117</guid>
      <dc:creator>rusty</dc:creator>
      <dc:date>2023-06-09T05:34:52Z</dc:date>
    </item>
    <item>
      <title>Re: "Photon ran out of memory" while when trying to get the unique Id from sql query</title>
      <link>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3354#M118</link>
      <description>&lt;P&gt;that collect statement moves all data to the driver.  So you lose all parallelism and the driver has to do all the processing.  If you beef up your driver, it might work.&lt;/P&gt;</description>
      <pubDate>Fri, 09 Jun 2023 09:59:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3354#M118</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2023-06-09T09:59:47Z</dc:date>
    </item>
    <item>
      <title>Re: "Photon ran out of memory" while when trying to get the unique Id from sql query</title>
      <link>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3355#M119</link>
      <description>&lt;P&gt;Hi @Anil Kumar Chauhan​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We haven't heard from you since the last response from @Werner Stinckens​&amp;nbsp; . Kindly share the information with us, and in return, we will provide you with the necessary solution.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks and Regards&lt;/P&gt;</description>
      <pubDate>Wed, 14 Jun 2023 06:15:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/machine-learning/quot-photon-ran-out-of-memory-quot-while-when-trying-to-get-the/m-p/3355#M119</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-06-14T06:15:30Z</dc:date>
    </item>
  </channel>
</rss>

