<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Union of tiny dataframes exhausts resource, memory error in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134878#M2285</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1. I've suggested this approach already in my first reply - unfortunately it didn't help&amp;nbsp;&lt;/P&gt;&lt;P&gt;3, 5, 6 - won't work here because &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/189678"&gt;@CEH&lt;/a&gt;&amp;nbsp; is using serverless compute&lt;/P&gt;&lt;P&gt;2,4 - worth a try &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 14 Oct 2025 15:20:42 GMT</pubDate>
    <dc:creator>szymon_dybczak</dc:creator>
    <dc:date>2025-10-14T15:20:42Z</dc:date>
    <item>
      <title>Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134199#M2280</link>
      <description>&lt;P&gt;As part of a function I create df1 and df2 and aim to stack them and output the results.&amp;nbsp; But the results do not display within the function, nor if I output the results and display after.&lt;/P&gt;&lt;P&gt;results = df1.unionByName(df2, allowMissingColumns=False)&lt;/P&gt;&lt;P&gt;display(results)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;This is the error:&lt;/P&gt;&lt;P&gt;SparkConnectGrpcException: &amp;lt;_InactiveRpcError of RPC that terminated with:&lt;/P&gt;&lt;P&gt;status = StatusCode.RESOURCE_EXHAUSTED&lt;/P&gt;&lt;P&gt;details = "CLIENT: Sent message larger than max (200529144 vs. 134217728)"&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;df1 is 350 rows and df2 is 1500 rows.&amp;nbsp; They share the same seven ordered columns, and I have checked they share the same schema.&amp;nbsp; Though df1 does have pure nulls for c3 and c4.&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c1: long (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c2: string (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c3: string (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c4: string (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c5: string (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c6: string (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;|-- c7: double (nullable = true)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The input df into the function is 3800 rows and 8 columns, the function is not complicated, and yet resource is exhausted trying to union two tiny outputs.&amp;nbsp; I can display df1 and df2 before the union, it is the union that crashes Databricks.&lt;/P&gt;&lt;P&gt;I tried manually inputting the data and creating two dataframes, and they union in less than 1s.&amp;nbsp; In the same project I have used unionByName to union larger dataframes, as part of more complicated functions.&amp;nbsp; Plus, this function used to work when I used test data smaller than 350 and 1500 rows.&lt;/P&gt;&lt;P&gt;What solutions could I try to fix this, repartitioning doesn't help.&amp;nbsp; Thank you.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Oct 2025 10:36:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134199#M2280</guid>
      <dc:creator>CEH</dc:creator>
      <dc:date>2025-10-08T10:36:21Z</dc:date>
    </item>
    <item>
      <title>Re: Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134200#M2281</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/189678"&gt;@CEH&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;Increase the message size limit from the default of 64 MB by changing the value of the&amp;nbsp;spark.sql.session.localRelationCacheThreshold&amp;nbsp;configuration.&amp;nbsp;You can start by trying twice the default, or 128MB. Experiment with increasing further if the issue persists.&lt;/P&gt;&lt;P&gt;Alternatively, take advantage of temporary (temp) views. Using temp views in the intermediary steps caches the table and uses a&amp;nbsp;cached_relation&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instead of a&amp;nbsp;local_relation.&amp;nbsp;Cached relations do not have a maximum message size, allowing you to avoid a message size limit.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;The following code demonstrates how to use temp views. Write&amp;nbsp;&lt;/SPAN&gt;df_union&lt;SPAN&gt;&amp;nbsp;as a temp view and then read it on every step of the loop to ensure the message being sent uses a cached relation.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;# dfs is a list of dataframes to aggregate with union commands. df_union stores the aggregation, and starts with the first dataframe in the list.
df_union = dfs[0]

#loop through all other dataframes in the list performing unions
for df in dfs[1:]:
	df_union = df_union.union(df)
	#create the temp view
	df_union.createOrReplaceTempView("df_union")
	#make it so df_union now have data coming from the temp view and can take advantage of the cached_relation
	df_union = spark.sql("SELECT * FROM df_union")&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Source:&amp;nbsp;&lt;A href="https://kb.databricks.com/python/resources_exhausted-error-message-when-trying-to-perform-self-joins-with-spark-connect" target="_blank" rel="noopener"&gt;RESOURCES_EXHAUSTED error message when trying to perform self-joins with Spark Connect - Databricks&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Oct 2025 10:57:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134200#M2281</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-10-08T10:57:40Z</dc:date>
    </item>
    <item>
      <title>Re: Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134204#M2283</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/110502"&gt;@szymon_dybczak&lt;/a&gt;,&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you for your reply, but your advice doesn't work for me.&amp;nbsp; I am unable to change the message size because I have to use Serverless.&amp;nbsp; Also, I tried your code using temp but that also doesn't work.&amp;nbsp; It gives this error:&amp;nbsp;&lt;/P&gt;&lt;P&gt;SparkConnectGrpcException: &amp;lt;_MultiThreadedRendezvous of RPC that terminated with:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; status = StatusCode.RESOURCE_EXHAUSTED&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; details = "CLIENT: Sent message larger than max (200529416 vs. 134217728)"&lt;/P&gt;</description>
      <pubDate>Wed, 08 Oct 2025 11:22:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134204#M2283</guid>
      <dc:creator>CEH</dc:creator>
      <dc:date>2025-10-08T11:22:34Z</dc:date>
    </item>
    <item>
      <title>Re: Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134869#M2284</link>
      <description>&lt;P class="p3"&gt;&lt;STRONG&gt;Hey &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/189678"&gt;@CEH&lt;/a&gt;,&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;What you’re running into looks like a &lt;SPAN class="s2"&gt;&lt;STRONG&gt;Spark Connect gRPC message-size limit&lt;/STRONG&gt;&lt;/SPAN&gt;, not a computational failure with the union itself. Even with smallish row counts, the serialized payload (either the inlined query plan or Arrow batch results) can blow past the default &lt;SPAN class="s2"&gt;&lt;STRONG&gt;128 MB gRPC cap&lt;/STRONG&gt;&lt;/SPAN&gt; and trigger a &lt;SPAN class="s3"&gt;RESOURCE_EXHAUSTED&lt;/SPAN&gt; error — the classic “Sent message larger than max (… vs. 134217728)”&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;Why this happens&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Local-relation inlining explodes plan size.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;If &lt;SPAN class="s1"&gt;df1&lt;/SPAN&gt;/&lt;SPAN class="s1"&gt;df2&lt;/SPAN&gt; are created from local Python data, Spark inlines that data as a &lt;SPAN class="s1"&gt;local_relation&lt;/SPAN&gt;. When you union (or self-join) them, that data is duplicated in the plan, and the serialized plan can easily exceed 128 MB even if the row count looks tiny.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Arrow batches can exceed the limit.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;&lt;SPAN class="s1"&gt;display()&lt;/SPAN&gt; or &lt;SPAN class="s1"&gt;collect()&lt;/SPAN&gt; sends Arrow batches back to the client. A single batch with wide string columns or many rows can tip the scale.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;Limit is hard-coded at 128 MB.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;Some configs exist to raise it, but Databricks environments may not honor them yet.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;&lt;STRONG&gt;Practical fixes (pick one – or layer a few)&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P class="p3"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_1:"&gt;1️⃣&lt;/span&gt; Materialize first, then union via SQL.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p1"&gt;This swaps &lt;SPAN class="s3"&gt;local_relation&lt;/SPAN&gt; for a cached relation.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df1.createOrReplaceTempView("df1")
df2.createOrReplaceTempView("df2")

results = spark.sql("""
  SELECT c1, c2, c3, c4, c5, c6, c7 FROM df1
  UNION ALL
  SELECT c1, c2, c3, c4, c5, c6, c7 FROM df2
""")

display(results.limit(1000))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_2:"&gt;2️⃣&lt;/span&gt; Persist and read back (Delta or managed tables).&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;Catalog-backed relations avoid the message-size constraint.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;df1.write.mode("overwrite").saveAsTable("tmp.df1")
df2.write.mode("overwrite").saveAsTable("tmp.df2")

results = spark.sql("""
  SELECT * FROM tmp.df1
  UNION ALL
  SELECT * FROM tmp.df2
""")

display(results.limit(1000))&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_3:"&gt;3️⃣&lt;/span&gt; Reduce Arrow batch size.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;Smaller batches → smaller gRPC messages.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.sql.execution.arrow.maxRecordsPerBatch", "2000")
display(results)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_4:"&gt;4️⃣&lt;/span&gt; Truncate wide string columns before display.&lt;/STRONG&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions import col, substring

small = results.select(
    col("c1"),
    *[substring(col(c), 1, 2000).alias(c) for c in ["c2","c3","c4","c5","c6"]],
    col("c7")
).limit(1000)

display(small)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_5:"&gt;5️⃣&lt;/span&gt; Avoid big local Python objects.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P class="p2"&gt;Parallelize before creating DataFrames:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;rdd = spark.sparkContext.parallelize(local_rows)
df1 = spark.createDataFrame(rdd, schema=my_schema)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":keycap_6:"&gt;6️⃣&lt;/span&gt; (Advanced)&lt;/STRONG&gt;&lt;/SPAN&gt; Try raising the limit — if allowed:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.connect.grpc.maxInboundMessageSize", 268435456)  # 256 MB&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P class="p1"&gt;May or may not be honored depending on your workspace setup.&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":light_bulb:"&gt;💡&lt;/span&gt; Why &lt;/STRONG&gt;&lt;STRONG&gt;repartition()&amp;nbsp;&lt;/STRONG&gt;&lt;STRONG style="color: inherit;"&gt;didn’t help&lt;/STRONG&gt;&lt;/H3&gt;
&lt;P class="p1"&gt;Repartitioning changes distribution, not the size of the serialized Arrow batch or query plan. The failure happens during serialization when the message crosses the 128 MB threshold — not during computation.&lt;/P&gt;
&lt;P class="p2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;&lt;STRONG&gt;&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; Quick checklist to unblock you&lt;/STRONG&gt;&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Convert &lt;SPAN class="s1"&gt;df1&lt;/SPAN&gt;/&lt;SPAN class="s1"&gt;df2&lt;/SPAN&gt; to temp views and &lt;SPAN class="s1"&gt;UNION ALL&lt;/SPAN&gt; via SQL.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;&lt;SPAN class="s1"&gt;Lower &lt;/SPAN&gt;spark.sql.execution.arrow.maxRecordsPerBatch&lt;SPAN class="s1"&gt;.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;Truncate long string columns before display.&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;
&lt;P class="p1"&gt;If data comes from local Python, use &lt;SPAN class="s1"&gt;parallelize()&lt;/SPAN&gt; or persist &amp;amp; read back.&lt;/P&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P class="p1"&gt;Hope this helps, Lou.&lt;/P&gt;</description>
      <pubDate>Tue, 14 Oct 2025 14:31:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134869#M2284</guid>
      <dc:creator>Louis_Frolio</dc:creator>
      <dc:date>2025-10-14T14:31:54Z</dc:date>
    </item>
    <item>
      <title>Re: Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134878#M2285</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/34815"&gt;@Louis_Frolio&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;1. I've suggested this approach already in my first reply - unfortunately it didn't help&amp;nbsp;&lt;/P&gt;&lt;P&gt;3, 5, 6 - won't work here because &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/189678"&gt;@CEH&lt;/a&gt;&amp;nbsp; is using serverless compute&lt;/P&gt;&lt;P&gt;2,4 - worth a try &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 14 Oct 2025 15:20:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/134878#M2285</guid>
      <dc:creator>szymon_dybczak</dc:creator>
      <dc:date>2025-10-14T15:20:42Z</dc:date>
    </item>
    <item>
      <title>Re: Union of tiny dataframes exhausts resource, memory error</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/136736#M2305</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/189678"&gt;@CEH&lt;/a&gt;!&lt;/P&gt;
&lt;P&gt;Did any of the suggestions above help resolve the issue?&lt;BR /&gt;If so, please mark the most helpful reply as the accepted solution. Or, if you found another fix,&amp;nbsp;please share it with the community so others can benefit as well.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Oct 2025 11:32:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/union-of-tiny-dataframes-exhausts-resource-memory-error/m-p/136736#M2305</guid>
      <dc:creator>Advika</dc:creator>
      <dc:date>2025-10-30T11:32:50Z</dc:date>
    </item>
  </channel>
</rss>

