topic Re: foreachPartition in Data Engineering

foreachPartition

yanchr — Wed, 10 Jun 2026 15:50:11 GMT

Is there any difference between pyspark.RDD.foreachPartition vs pyspark.sql.DataFrame.foreachPartition under the hood? The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"

If DataFrame.foreachPartition internally touches .rdd, then on a shared cluster where RDD APIs and sparkContext access are restricted I'd expect it to be blocked, just like calling df.rdd.foreachPartition(...) directly. But I've got successful result.

So is it okay to usepyspark.sql.DataFrame.foreachPartition on shared cluster or should I migrate to mapInPandas / mapInArrow?

Re: foreachPartition

Ashwin_DSA — Wed, 10 Jun 2026 17:27:38 GMT

Hi @yanchr,

I think the safest way to explain this is that the DataFrame.foreachPartition docstring is directionally true for classic Spark, but it is not a reliable way to predict behaviour on Databricks shared or standard compute.

On shared clusters, RDD APIs are not supported, and newer runtimes use Spark Connect, which changes how some DataFrame APIs are implemented and executed. So even though the PySpark docs describe DataFrame.foreachPartition as shorthand for df.rdd.foreachPartition(), that does not mean Databricks must expose or route it through the user-visible .rdd API in shared mode.

That is why what you observed is possible... df.rdd.foreachPartition(...) can be blocked because it is explicitly an RDD API, while df.foreachPartition(...) can still succeed because it is exposed as a supported DataFrame-level API on that runtime.

So if your code is already using pyspark.sql.DataFrame.foreachPartition and it works on the target shared cluster/runtime, I would not treat that as something you must migrate away from purely because the docstring mentions .rdd. The thing I would avoid is relying on df.rdd itself on shared compute.

Whether you should move to mapInPandas or mapInArrow depends more on the kind of work you are doing. If this is a side-effecting per-partition action, keeping DataFrame.foreachPartition is reasonable. If you are really doing an RDD-style partition transformation that returns rows, then it is better to move toward DataFrame-native approaches such as mapInPandas or other Arrow-based patterns. Databricks' guidance for shared-cluster migrations points users away from RDD mapPartitions patterns and toward DataFrame APIs and native Arrow UDFs, and the standard compute docs also call out support for applyInPandas and mapInPandas on newer runtimes. See Update jobs when you upgrade legacy workspaces to Unity Catalog and the standard compute limitations page.

So the short answer is...yes, DataFrame.foreachPartition can behave differently from df.rdd.foreachPartition(...) on shared clusters, and if the former works on your target DBR, it is generally fine to use it. I would only recommend migrating to mapInPandas or mapInArrow if you need a partition-wise transformation rather than a per-partition action.

If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.

Re: foreachPartition

naveen0808 — Wed, 10 Jun 2026 23:56:47 GMT

keep what's working, migrate only if your use case grows beyond side effects.

Re: foreachPartition

ashukasma — Thu, 11 Jun 2026 13:05:52 GMT

Although the PySpark documentation states that DataFrame.foreachPartition() is a shorthand for df.rdd. foreachPartition(), there is an important difference when running on Databricks shared clusters (especially with Unity Catalog and Spark Connect).

Direct access to df.rdd is restricted on shared clusters, so df.rdd.foreachPartition() will typically fail.
However, DataFrame.foreachPartition() is a supported DataFrame API. Spark Connect handles the execution on the server side, so the underlying RDD operations are not exposed to your notebook code.
This is why DataFrame.foreachPartition() can work even when direct RDD access is blocked.

Can you continue using DataFrame.foreachPartition()?

Yes. If it is working in your shared cluster environment, it is a supported DataFrame API and there is no immediate need to replace it solely because it may use RDDs internally.

Should you migrate to mapInPandas() or mapInArrow()?

Consider migrating only if:

You need to transform and return data.
You want to leverage Arrow-based optimizations.
Your use case aligns better with Pandas or Arrow processing.

For simple side-effect operations (such as writing data to an external system per partition), DataFrame.foreachPartition() remains a valid and commonly used approach.

In short, the fact that df.rdd is blocked does not automatically mean DataFrame.foreachPartition() is unsupported. If your workload only needs partition-level processing without returning a DataFrame, continuing to use DataFrame.foreachPartition() is perfectly reasonable.