<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: foreachPartition in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158800#M54775</link>
    <description>&lt;P class=""&gt;&lt;SPAN&gt;Although the PySpark documentation states that &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is a shorthand for &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd. foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt;, there is an important difference when running on Databricks shared clusters (especially with Unity Catalog and Spark Connect).&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Direct access to &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd&lt;/SPAN&gt;&lt;SPAN&gt; is restricted on shared clusters, so &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; will typically fail.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;However, &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is a supported DataFrame API. Spark Connect handles the execution on the server side, so the underlying RDD operations are not exposed to your notebook code.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;This is why &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; can work even when direct RDD access is blocked.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN&gt;Can you continue using &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN&gt;Yes. If it is working in your shared cluster environment, it is a supported DataFrame API and there is no immediate need to replace it solely because it may use RDDs internally.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN&gt;Should you migrate to &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;mapInPandas()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt; or &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;mapInArrow()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN&gt;Consider migrating only if:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;You need to transform and return data.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;You want to leverage Arrow-based optimizations.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Your use case aligns better with Pandas or Arrow processing.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN&gt;For simple side-effect operations (such as writing data to an external system per partition), &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; remains a valid and commonly used approach.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In short, the fact that &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd&lt;/SPAN&gt;&lt;SPAN&gt; is blocked does not automatically mean &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is unsupported. If your workload only needs partition-level processing without returning a DataFrame, continuing to use &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is perfectly reasonable.&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 11 Jun 2026 13:05:52 GMT</pubDate>
    <dc:creator>ashukasma</dc:creator>
    <dc:date>2026-06-11T13:05:52Z</dc:date>
    <item>
      <title>foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158736#M54758</link>
      <description>&lt;P&gt;Is there any difference between&amp;nbsp;pyspark.RDD.foreachPartition vs&amp;nbsp;pyspark.sql.DataFrame.foreachPartition under the hood?&amp;nbsp;The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"&lt;/P&gt;&lt;P&gt;If DataFrame.foreachPartition internally touches .rdd, then on a shared cluster where RDD APIs and sparkContext access are restricted I'd expect it to be blocked, just like calling df.rdd.foreachPartition(...) directly. But I've got successful result.&lt;/P&gt;&lt;P&gt;So is it okay to usepyspark.sql.DataFrame.foreachPartition on shared cluster or should I migrate to mapInPandas / mapInArrow?&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 15:50:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158736#M54758</guid>
      <dc:creator>yanchr</dc:creator>
      <dc:date>2026-06-10T15:50:11Z</dc:date>
    </item>
    <item>
      <title>Re: foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158741#M54760</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/229230"&gt;@yanchr&lt;/a&gt;,&lt;/P&gt;
&lt;P data-pm-slice="1 1 []"&gt;I think the safest way to explain this is that the DataFrame.foreachPartition docstring is directionally true for classic Spark, but it is not a reliable way to predict behaviour on Databricks shared or standard compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;On shared clusters, &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;RDD APIs are not supported&lt;/A&gt;, and newer runtimes use &lt;A href="https://docs.databricks.com/aws/en/spark/connect-vs-classic" rel="noopener noreferrer nofollow" target="_blank"&gt;Spark Connect&lt;/A&gt;, which changes how some DataFrame APIs are implemented and executed. So even though the PySpark docs describe DataFrame.foreachPartition as shorthand for df.rdd.foreachPartition(), that does not mean Databricks must expose or route it through the user-visible .rdd API in shared mode.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;That is why what you observed is possible... df.rdd.foreachPartition(...) can be blocked because it is explicitly an RDD API, while df.foreachPartition(...) can still succeed because it is exposed as a supported DataFrame-level API on that runtime.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So if your code is already using pyspark.sql.DataFrame.foreachPartition and it works on the target shared cluster/runtime, I would not treat that as something you must migrate away from purely because the docstring mentions .rdd. The thing I would avoid is relying on df.rdd itself on shared compute.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;Whether you should move to mapInPandas or mapInArrow depends more on the kind of work you are doing. If this is a side-effecting per-partition action, keeping DataFrame.foreachPartition is reasonable. If you are really doing an RDD-style partition transformation that returns rows, then it is better to move toward DataFrame-native approaches such as mapInPandas or other Arrow-based patterns. Databricks' guidance for shared-cluster migrations points users away from RDD mapPartitions patterns and toward DataFrame APIs and native Arrow UDFs, and the standard compute docs also call out support for applyInPandas and mapInPandas on newer runtimes. See &lt;A href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/jobs-update" rel="noopener noreferrer nofollow" target="_blank"&gt;Update jobs when you upgrade legacy workspaces to Unity Catalog&lt;/A&gt; and the &lt;A href="https://docs.databricks.com/aws/en/compute/standard-limitations" rel="noopener noreferrer nofollow" target="_blank"&gt;standard compute limitations&lt;/A&gt; page.&lt;/P&gt;
&lt;P class="wnfdntt _1ibi0s3f5 _1ibi0s3ce _1ibi0s3ea"&gt;So the short answer is...yes, DataFrame.foreachPartition can behave differently from df.rdd.foreachPartition(...) on shared clusters, and if the former works on your target DBR, it is generally fine to use it. I would only recommend migrating to mapInPandas or mapInArrow if you need a partition-wise transformation rather than a per-partition action.&lt;/P&gt;
&lt;P class="p1"&gt;&lt;FONT size="2" color="#FF6600"&gt;&lt;STRONG&gt;&lt;I&gt;If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.&lt;/I&gt;&lt;/STRONG&gt;&lt;/FONT&gt;&lt;I&gt;&lt;/I&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 17:27:38 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158741#M54760</guid>
      <dc:creator>Ashwin_DSA</dc:creator>
      <dc:date>2026-06-10T17:27:38Z</dc:date>
    </item>
    <item>
      <title>Re: foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158751#M54763</link>
      <description>&lt;P&gt;keep what's working, migrate only if your use case grows beyond side effects.&lt;/P&gt;</description>
      <pubDate>Wed, 10 Jun 2026 23:56:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158751#M54763</guid>
      <dc:creator>naveen0808</dc:creator>
      <dc:date>2026-06-10T23:56:47Z</dc:date>
    </item>
    <item>
      <title>Re: foreachPartition</title>
      <link>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158800#M54775</link>
      <description>&lt;P class=""&gt;&lt;SPAN&gt;Although the PySpark documentation states that &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is a shorthand for &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd. foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt;, there is an important difference when running on Databricks shared clusters (especially with Unity Catalog and Spark Connect).&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;Direct access to &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd&lt;/SPAN&gt;&lt;SPAN&gt; is restricted on shared clusters, so &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; will typically fail.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;However, &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is a supported DataFrame API. Spark Connect handles the execution on the server side, so the underlying RDD operations are not exposed to your notebook code.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;This is why &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; can work even when direct RDD access is blocked.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN&gt;Can you continue using &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN&gt;Yes. If it is working in your shared cluster environment, it is a supported DataFrame API and there is no immediate need to replace it solely because it may use RDDs internally.&lt;/SPAN&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;STRONG&gt;&lt;SPAN&gt;Should you migrate to &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;mapInPandas()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt; or &lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;mapInArrow()&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;SPAN&gt;?&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;&lt;P class=""&gt;&lt;SPAN&gt;Consider migrating only if:&lt;/SPAN&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN&gt;You need to transform and return data.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;You want to leverage Arrow-based optimizations.&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN&gt;Your use case aligns better with Pandas or Arrow processing.&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P class=""&gt;&lt;SPAN&gt;For simple side-effect operations (such as writing data to an external system per partition), &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; remains a valid and commonly used approach.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;In short, the fact that &lt;/SPAN&gt;&lt;SPAN&gt;df.rdd&lt;/SPAN&gt;&lt;SPAN&gt; is blocked does not automatically mean &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is unsupported. If your workload only needs partition-level processing without returning a DataFrame, continuing to use &lt;/SPAN&gt;&lt;SPAN&gt;DataFrame.foreachPartition()&lt;/SPAN&gt;&lt;SPAN&gt; is perfectly reasonable.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 11 Jun 2026 13:05:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/foreachpartition/m-p/158800#M54775</guid>
      <dc:creator>ashukasma</dc:creator>
      <dc:date>2026-06-11T13:05:52Z</dc:date>
    </item>
  </channel>
</rss>

