How to apply Pandas functions on PySpark DataFrame?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-22-2022 03:38 AM
Hi,
I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.
I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is used for computation.
What is the best way you suggest to use Pandas functions on PySpark DataFrame while having all processes on multi-node cluster?
- Labels:
-
Pysaprk dataframes
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2022 02:00 PM
The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frame
import pyspark.pandas as ps
psdf = ps.range(10)
sdf = psdf.to_spark().filter("id > 5")
sdf.show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-23-2022 03:06 PM
Thanks for your reply.
I want to apply Pandas function on PySpark DataFrame (like how I use Pandas on DataFrames on a local laptop). But, I think the above example uses PySpark function "filter".

