Re: Pandas API on Spark, Does it run on a multi-no...

Debayan · ‎10-18-2022

Hi @Mohammad Saber ,

Pandas dataset lives in the single machine, and is naturally iterable locally within the same machine. However, pandas-on-Spark dataset lives across multiple machines, and they are computed in a distributed manner. It is difficult to be locally iterable and it is very likely users collect the entire data into the client side without knowing it. Therefore, it is best to stick to using pandas-on-Spark APIs.

Please refer:

https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#use-p...

https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html

https://docs.databricks.com/languages/pandas-spark.html

Please let us know if you need further clarification on the same. We are more than happy to assist you further.