<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Requested array size exceeds VM limit when saving to feature table in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3437#M482</link>
    <description>&lt;P&gt;Hi, I'm trying to process a small dataset (less than 300 Mb) composed by five queries that run with spark.&amp;nbsp;The end result of those queries is parsed using python and merged into a data frame. Then I try to write this to a delta lake table using features:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;parsedData.write.format('delta').mode('overwrite').option("mergeSchema", "true").save('/mnt/features/dev_customer_account_info') &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This single line of code above always causes a memory spike leading to a crash (60 Gb), regardless of the size of parsedData. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster is configured as: &lt;/P&gt;&lt;P&gt;  1 Driver 61 GB Memory, 8 Cores&lt;/P&gt;&lt;P&gt;  Runtime 11.3.x-cpu-ml-scala2.12&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The error looks like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding.encode(StringCoding.java:350)
	at java.lang.String.getBytes(String.java:941)
	at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:163)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.getAndReset$1(jsonExpressions.scala:893)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.$anonfun$converter$5(jsonExpressions.scala:904)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson$$Lambda$12421/1187286213.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.nullSafeEval(jsonExpressions.scala:947)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:671)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:127)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$Lambda$12407/1574333163.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.util.GroupedAsArrayIterator.next(GroupedAsArrayIterator.scala:45)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:464)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:55)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:573)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$Lambda$11996/626269711.apply(Unknown Source)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2340)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:365)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I've tried to run the queries with paging, to reduce the amount of data that should be saved to the table, probably reduce it to as little as 100 Mb, but this part always consumes all available ram and crashes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The contents of the data frame are pretty standard. I'm at a loss here about what could be done. I'd really appreciate any comments, thoughts or ideas. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much&lt;/P&gt;</description>
    <pubDate>Wed, 07 Jun 2023 21:09:05 GMT</pubDate>
    <dc:creator>pcriado</dc:creator>
    <dc:date>2023-06-07T21:09:05Z</dc:date>
    <item>
      <title>Requested array size exceeds VM limit when saving to feature table</title>
      <link>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3437#M482</link>
      <description>&lt;P&gt;Hi, I'm trying to process a small dataset (less than 300 Mb) composed by five queries that run with spark.&amp;nbsp;The end result of those queries is parsed using python and merged into a data frame. Then I try to write this to a delta lake table using features:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;parsedData.write.format('delta').mode('overwrite').option("mergeSchema", "true").save('/mnt/features/dev_customer_account_info') &lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This single line of code above always causes a memory spike leading to a crash (60 Gb), regardless of the size of parsedData. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The cluster is configured as: &lt;/P&gt;&lt;P&gt;  1 Driver 61 GB Memory, 8 Cores&lt;/P&gt;&lt;P&gt;  Runtime 11.3.x-cpu-ml-scala2.12&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The error looks like this:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;java.lang.OutOfMemoryError: Requested array size exceeds VM limit
	at java.lang.StringCoding.encode(StringCoding.java:350)
	at java.lang.String.getBytes(String.java:941)
	at org.apache.spark.unsafe.types.UTF8String.fromString(UTF8String.java:163)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.getAndReset$1(jsonExpressions.scala:893)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.$anonfun$converter$5(jsonExpressions.scala:904)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson$$Lambda$12421/1187286213.apply(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.nullSafeEval(jsonExpressions.scala:947)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:671)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown Source)
	at org.apache.spark.sql.execution.python.EvalPythonExec.$anonfun$doExecute$10(EvalPythonExec.scala:127)
	at org.apache.spark.sql.execution.python.EvalPythonExec$$Lambda$12407/1574333163.apply(Unknown Source)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at org.apache.spark.util.GroupedAsArrayIterator.next(GroupedAsArrayIterator.scala:45)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
	at scala.collection.Iterator.foreach(Iterator.scala:943)
	at scala.collection.Iterator.foreach$(Iterator.scala:943)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:464)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:55)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:573)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread$$Lambda$11996/626269711.apply(Unknown Source)
	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2340)
	at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:365)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I've tried to run the queries with paging, to reduce the amount of data that should be saved to the table, probably reduce it to as little as 100 Mb, but this part always consumes all available ram and crashes.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The contents of the data frame are pretty standard. I'm at a loss here about what could be done. I'd really appreciate any comments, thoughts or ideas. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thank you very much&lt;/P&gt;</description>
      <pubDate>Wed, 07 Jun 2023 21:09:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3437#M482</guid>
      <dc:creator>pcriado</dc:creator>
      <dc:date>2023-06-07T21:09:05Z</dc:date>
    </item>
    <item>
      <title>Re: Requested array size exceeds VM limit when saving to feature table</title>
      <link>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3438#M483</link>
      <description>&lt;P&gt;Hi @Pablo Criado​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Great to meet you, and thanks for your question! &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Let's see if your peers in the community have an answer to your question. Thanks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jun 2023 06:03:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3438#M483</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2023-06-16T06:03:56Z</dc:date>
    </item>
    <item>
      <title>Re: Requested array size exceeds VM limit when saving to feature table</title>
      <link>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3439#M484</link>
      <description>&lt;P&gt;Hello, we have recently found that it's my user in particular that casues the memory issue. Two other users in my organization can run the same notebook without problems, but my user consistenly consumes all available ram and crashes the cluster... and I have absolutely no idea how something like that could happen.&lt;/P&gt;</description>
      <pubDate>Fri, 16 Jun 2023 12:30:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/requested-array-size-exceeds-vm-limit-when-saving-to-feature/m-p/3439#M484</guid>
      <dc:creator>pcriado</dc:creator>
      <dc:date>2023-06-16T12:30:58Z</dc:date>
    </item>
  </channel>
</rss>

