foreachPartition

yanchr
New Contributor II

Is there any difference between pyspark.RDD.foreachPartition vs pyspark.sql.DataFrame.foreachPartition under the hood? The PySpark documentation describes pyspark.sql.DataFrame.foreachPartition as "a shorthand for df.rdd.foreachPartition()"

If DataFrame.foreachPartition internally touches .rdd, then on a shared cluster where RDD APIs and sparkContext access are restricted I'd expect it to be blocked, just like calling df.rdd.foreachPartition(...) directly. But I've got successful result.

So is it okay to usepyspark.sql.DataFrame.foreachPartition on shared cluster or should I migrate to mapInPandas / mapInArrow?