<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: OversizedAllocationException with transformWithStateInPandas in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152774#M53875</link>
    <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24053"&gt;@lingareddy_Alva&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;Thank you very much for your answer. I feel like the&amp;nbsp;&lt;SPAN&gt;spark.sql.execution.arrow.useLargeVarTypes config not being respected by the transformWithStateInPandas serializer path is a bug. Is there somewhere I could report it to the Spark development team? I am unfortunatly not a Scala or Java dev and cannot send a patch myself. If I could, I would...&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Regarding your suggestions to work around the issue, we are already using the binary type for the file contents. Spliting the files in chunks is unfortunatly not very feasable for us. However, I don't even use the file's content in the stateful processor, I use other columns around it. The file's content is use in a later stage of our pipeline, so I really only need to have it pass throught the stateful processor to have it available on the other side. I was thinking that it might be possible to split my stream in 2 branches, one that sends the relevant columns to the stateful processor and the file content goes "around" it and the the 2 branches are stiched back together with a join. I am however worried about join 2 streaming datasets together, especially when there is statefulness involved.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you again for your help. I will wait for a few days before marking your answer as accepted in case you have the time to write me back, because it really did answer my original question regarding the OversizedAllocationException&amp;nbsp;exception.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Best regards&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;TWBDE&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 31 Mar 2026 17:15:28 GMT</pubDate>
    <dc:creator>twbde</dc:creator>
    <dc:date>2026-03-31T17:15:28Z</dc:date>
    <item>
      <title>OversizedAllocationException with transformWithStateInPandas</title>
      <link>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152591#M53850</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;I have a process that uses&amp;nbsp;transformWithStateInPandas on a dataframe that has the content on entire files in on of the columns. Recently, the exception&amp;nbsp;OversizedAllocationException has started happening. I have tried setting these configs in the job cluster's definitions, with no success.&lt;BR /&gt;&lt;BR /&gt;spark.sql.execution.arrow.pyspark.enabled true&lt;BR /&gt;spark.sql.execution.arrow.transformWithStateInPandas.maxRecordsPerBatch 10&lt;BR /&gt;spark.sql.execution.arrow.useLargeVarTypes true&lt;BR /&gt;spark.sql.execution.pythonUDF.arrow.enabled true&lt;/P&gt;&lt;P&gt;I have also tried&amp;nbsp;maxRecordsPerBatch at 1 and still got issues. And the message is always that is tried to allocate over 2GB. It's almost as if&amp;nbsp;useLargeVarTypes is ignored for&amp;nbsp;transformWithStateInPandas. I have also tried dissabling arrow with:&lt;/P&gt;&lt;P&gt;spark.sql.execution.arrow.pyspark.enabled false&lt;BR /&gt;spark.sql.execution.pythonUDF.arrow.enabled true&lt;/P&gt;&lt;P&gt;I am using Databricks runtime 17.3 LTS, so Spark 4.0.0. My code is python code with PySpark.&lt;/P&gt;&lt;P&gt;Thank you for any help you can give me.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2026 19:43:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152591#M53850</guid>
      <dc:creator>twbde</dc:creator>
      <dc:date>2026-03-30T19:43:51Z</dc:date>
    </item>
    <item>
      <title>Re: OversizedAllocationException with transformWithStateInPandas</title>
      <link>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152753#M53869</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/224552"&gt;@twbde&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is a genuinely tricky problem. Here's the diagnosis and the best available workarounds:&lt;BR /&gt;Root Cause: useLargeVarTypes Is Not Wired Into transformWithStateInPandas&lt;BR /&gt;Your instinct is correct. The spark.sql.execution.arrow.useLargeVarTypes config is not respected by the transformWithStateInPandas serializer path. Looking at how Spark's Arrow infrastructure is built, useLargeVarTypes is plumbed through toPandas() / createDataFrame() and the general Pandas UDF serializers — but transformWithStateInPandas has its own dedicated serializer (TransformWithStateInPandasSerializer), and the largeVarTypes flag is simply not passed through to its ArrowWriter.create(...) call&lt;/P&gt;&lt;P&gt;The result: Arrow still uses the standard VarCharVector / VarBinaryVector which have a hard 2GB cap per column per batch — and even maxRecordsPerBatch=1 won't help if a single file's content already exceeds 2GB, or if the cumulative buffer for the column in a single batch hits that limit through Arrow's internal resizing logic.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Workarounds (in order of preference)&lt;/STRONG&gt;&lt;BR /&gt;&lt;STRONG&gt;1. Pre-chunk large file content in your input stream (most reliable)&lt;/STRONG&gt;&lt;BR /&gt;Before the data reaches transformWithStateInPandas, break the file-content column into chunks smaller than, say, 500MB each. Tag each chunk with a sequence number and the grouping key. Reassemble in the stateful processor using your state store.&lt;/P&gt;&lt;P&gt;Then reassemble in your StatefulProcessor's handleInputRows using a ListState or MapState keyed by chunk_seq, and only emit once all chunks have arrived.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;2. Use BinaryType instead of StringType for the file content column&lt;/STRONG&gt;&lt;BR /&gt;If your column is typed as StringType, Arrow uses VarCharVector. If you cast it to BinaryType, Arrow uses VarBinaryVector — same 2GB wall, but sometimes the buffer growth patterns differ and you can at least get slightly further. More importantly, it makes chunking and reassembly more natural for binary file content.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 15:33:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152753#M53869</guid>
      <dc:creator>lingareddy_Alva</dc:creator>
      <dc:date>2026-03-31T15:33:37Z</dc:date>
    </item>
    <item>
      <title>Re: OversizedAllocationException with transformWithStateInPandas</title>
      <link>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152774#M53875</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/24053"&gt;@lingareddy_Alva&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;Thank you very much for your answer. I feel like the&amp;nbsp;&lt;SPAN&gt;spark.sql.execution.arrow.useLargeVarTypes config not being respected by the transformWithStateInPandas serializer path is a bug. Is there somewhere I could report it to the Spark development team? I am unfortunatly not a Scala or Java dev and cannot send a patch myself. If I could, I would...&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Regarding your suggestions to work around the issue, we are already using the binary type for the file contents. Spliting the files in chunks is unfortunatly not very feasable for us. However, I don't even use the file's content in the stateful processor, I use other columns around it. The file's content is use in a later stage of our pipeline, so I really only need to have it pass throught the stateful processor to have it available on the other side. I was thinking that it might be possible to split my stream in 2 branches, one that sends the relevant columns to the stateful processor and the file content goes "around" it and the the 2 branches are stiched back together with a join. I am however worried about join 2 streaming datasets together, especially when there is statefulness involved.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Thank you again for your help. I will wait for a few days before marking your answer as accepted in case you have the time to write me back, because it really did answer my original question regarding the OversizedAllocationException&amp;nbsp;exception.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Best regards&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;TWBDE&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2026 17:15:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/oversizedallocationexception-with-transformwithstateinpandas/m-p/152774#M53875</guid>
      <dc:creator>twbde</dc:creator>
      <dc:date>2026-03-31T17:15:28Z</dc:date>
    </item>
  </channel>
</rss>

