cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to apply Pandas functions on PySpark DataFrame?

Mado
Valued Contributor II

Hi,

I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.

I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is used for computation.

What is the best way you suggest to use Pandas functions on PySpark DataFrame while having all processes on multi-node cluster?

2 REPLIES 2

Hubert-Dudek
Esteemed Contributor III

The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frame

import pyspark.pandas as ps
 
psdf = ps.range(10)
sdf = psdf.to_spark().filter("id > 5")
sdf.show()

Mado
Valued Contributor II

Thanks for your reply.

I want to apply Pandas function on PySpark DataFrame (like how I use Pandas on DataFrames on a local laptop). But, I think the above example uses PySpark function "filter".

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group