<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: foreachPartition in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158741#M54760</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/229230"&gt;@yanchr&lt;/a&gt;,&lt;/P&gt;
&lt;P data-pm-slice="1 1 []"&gt;I think the safest way to explain this is that the DataFrame.foreachPartition docstring is directionally true for classic Spark, but it is not a reliable way to predict behaviour on Databricks shared or standard compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;On shared clusters, &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;RDD APIs are not supported&lt;/A&gt;, and newer runtimes use &lt;A href="https://docs.databricks.com/aws/en/spark/connect-vs-classic" rel="noopener noreferrer nofollow" target="_blank"&gt;Spark Connect&lt;/A&gt;, which changes how some DataFrame APIs are implemented and executed. So even though the PySpark docs describe DataFrame.foreachPartition as shorthand for df.rdd.foreachPartition(), that does not mean Databricks must expose or route it through the user-visible .rdd API in shared mode.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;That is why what you observed is possible... df.rdd.foreachPartition(...) can be blocked because it is explicitly an RDD API, while df.foreachPartition(...) can still succeed because it is exposed as a supported DataFrame-level API on that runtime.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So if your code is already using pyspark.sql.DataFrame.foreachPartition and it works on the target shared cluster/runtime, I would not treat that as something you must migrate away from purely because the docstring mentions .rdd. The thing I would avoid is relying on df.rdd itself on shared compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;Whether you should move to mapInPandas or mapInArrow depends more on the kind of work you are doing. If this is a side-effecting per-partition action, keeping DataFrame.foreachPartition is reasonable. If you are really doing an RDD-style partition transformation that returns rows, then it is better to move toward DataFrame-native approaches such as mapInPandas or other Arrow-based patterns. Databricks' guidance for shared-cluster migrations points users away from RDD mapPartitions patterns and toward DataFrame APIs and native Arrow UDFs, and the standard compute docs also call out support for applyInPandas and mapInPandas on newer runtimes. See &lt;A href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/jobs-update" rel="noopener noreferrer nofollow" target="_blank"&gt;Update jobs when you upgrade legacy workspaces to Unity Catalog&lt;/A&gt; and the &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;standard compute limitations&lt;/A&gt; page.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So the short answer is...yes, DataFrame.foreachPartition can behave differently from df.rdd.foreachPartition(...) on shared clusters, and if the former works on your target DBR, it is generally fine to use it. I would only recommend migrating to mapInPandas or mapInArrow if you need a partition-wise transformation rather than a per-partition action.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 10 Jun 2026 17:27:38 GMT</pubDate>
    <dc:creator>Ashwin_DSA</dc:creator>
    <dc:date>2026-06-10T17:27:38Z</dc:date>
    <item>
      <title>foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158736#M54758</link>
      <description>&lt;P&gt;Is there any difference between&amp;nbsp;pyspark.RDD.foreachPartition vs&amp;nbsp;pyspark.sql.DataFrame.foreachPartition under the hood?&amp;nbsp;The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"&lt;/P&gt;&lt;P&gt;If DataFrame.foreachPartition internally touches .rdd, then on a shared cluster where RDD APIs and sparkContext access are restricted I'd expect it to be blocked, just like calling df.rdd.foreachPartition(...) directly. But I've got successful result.&lt;/P&gt;&lt;P&gt;So is it okay to usepyspark.sql.DataFrame.foreachPartition on shared cluster or should I migrate to mapInPandas / mapInArrow?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 15:50:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158736#M54758</guid>
      <dc:creator>yanchr</dc:creator>
      <dc:date>2026-06-10T15:50:11Z</dc:date>
    </item>
    <item>
      <title>Re: foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158741#M54760</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/229230"&gt;@yanchr&lt;/a&gt;,&lt;/P&gt;
&lt;P data-pm-slice="1 1 []"&gt;I think the safest way to explain this is that the DataFrame.foreachPartition docstring is directionally true for classic Spark, but it is not a reliable way to predict behaviour on Databricks shared or standard compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;On shared clusters, &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;RDD APIs are not supported&lt;/A&gt;, and newer runtimes use &lt;A href="https://docs.databricks.com/aws/en/spark/connect-vs-classic" rel="noopener noreferrer nofollow" target="_blank"&gt;Spark Connect&lt;/A&gt;, which changes how some DataFrame APIs are implemented and executed. So even though the PySpark docs describe DataFrame.foreachPartition as shorthand for df.rdd.foreachPartition(), that does not mean Databricks must expose or route it through the user-visible .rdd API in shared mode.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;That is why what you observed is possible... df.rdd.foreachPartition(...) can be blocked because it is explicitly an RDD API, while df.foreachPartition(...) can still succeed because it is exposed as a supported DataFrame-level API on that runtime.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So if your code is already using pyspark.sql.DataFrame.foreachPartition and it works on the target shared cluster/runtime, I would not treat that as something you must migrate away from purely because the docstring mentions .rdd. The thing I would avoid is relying on df.rdd itself on shared compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;Whether you should move to mapInPandas or mapInArrow depends more on the kind of work you are doing. If this is a side-effecting per-partition action, keeping DataFrame.foreachPartition is reasonable. If you are really doing an RDD-style partition transformation that returns rows, then it is better to move toward DataFrame-native approaches such as mapInPandas or other Arrow-based patterns. Databricks' guidance for shared-cluster migrations points users away from RDD mapPartitions patterns and toward DataFrame APIs and native Arrow UDFs, and the standard compute docs also call out support for applyInPandas and mapInPandas on newer runtimes. See &lt;A href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/jobs-update" rel="noopener noreferrer nofollow" target="_blank"&gt;Update jobs when you upgrade legacy workspaces to Unity Catalog&lt;/A&gt; and the &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;standard compute limitations&lt;/A&gt; page.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So the short answer is...yes, DataFrame.foreachPartition can behave differently from df.rdd.foreachPartition(...) on shared clusters, and if the former works on your target DBR, it is generally fine to use it. I would only recommend migrating to mapInPandas or mapInArrow if you need a partition-wise transformation rather than a per-partition action.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 17:27:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158741#M54760</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-06-10T17:27:38Z</dc:date>
    </item>
    <item>
      <title>Re: foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158751#M54763</link>
      <description>&lt;P&gt;keep what's working, migrate only if your use case grows beyond side effects.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 23:56:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158751#M54763</guid>
      <dc:creator>naveen0808</dc:creator>
      <dc:date>2026-06-10T23:56:47Z</dc:date>
    </item>
  </channel>
</rss>

