cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

foreachPartition

yanchr
New Contributor II

Is there any difference between pyspark.RDD.foreachPartition vs pyspark.sql.DataFrame.foreachPartition under the hood? The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"

If DataFrame.foreachPartition internally touches .rdd, then on a shared cluster where RDD APIs and sparkContext access are restricted I'd expect it to be blocked, just like calling df.rdd.foreachPartition(...) directly. But I've got successful result.

So is it okay to usepyspark.sql.DataFrame.foreachPartition on shared cluster or should I migrate to mapInPandas / mapInArrow?

1 REPLY 1

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @yanchr,

I think the safest way to explain this is that the DataFrame.foreachPartition docstring is directionally true for classic Spark, but it is not a reliable way to predict behaviour on Databricks shared or standard compute.

On shared clusters, RDD APIs are not supported, and newer runtimes use Spark Connect, which changes how some DataFrame APIs are implemented and executed. So even though the PySpark docs describe DataFrame.foreachPartition as shorthand for df.rdd.foreachPartition(), that does not mean Databricks must expose or route it through the user-visible .rdd API in shared mode.

That is why what you observed is possible... df.rdd.foreachPartition(...) can be blocked because it is explicitly an RDD API, while df.foreachPartition(...) can still succeed because it is exposed as a supported DataFrame-level API on that runtime.

So if your code is already using pyspark.sql.DataFrame.foreachPartition and it works on the target shared cluster/runtime, I would not treat that as something you must migrate away from purely because the docstring mentions .rdd. The thing I would avoid is relying on df.rdd itself on shared compute.

Whether you should move to mapInPandas or mapInArrow depends more on the kind of work you are doing. If this is a side-effecting per-partition action, keeping DataFrame.foreachPartition is reasonable. If you are really doing an RDD-style partition transformation that returns rows, then it is better to move toward DataFrame-native approaches such as mapInPandas or other Arrow-based patterns. Databricks' guidance for shared-cluster migrations points users away from RDD mapPartitions patterns and toward DataFrame APIs and native Arrow UDFs, and the standard compute docs also call out support for applyInPandas and mapInPandas on newer runtimes. See Update jobs when you upgrade legacy workspaces to Unity Catalog and the standard compute limitations page.

So the short answer is...yes, DataFrame.foreachPartition can behave differently from df.rdd.foreachPartition(...) on shared clusters, and if the former works on your target DBR, it is generally fine to use it. I would only recommend migrating to mapInPandas or mapInArrow if you need a partition-wise transformation rather than a per-partition action.

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***