Databricks Community

Mado · ‎10-22-2022

Hi,

I want to apply Pandas functions (like isna, concat, append, etc) on PySpark DataFrame in such a way that computations are done on multi-node cluster.

I don't want to convert PySpark DataFrame into Pandas DataFrame since, I think, only one node is used for computation.

What is the best way you suggest to use Pandas functions on PySpark DataFrame while having all processes on multi-node cluster?

Hubert-Dudek · ‎10-23-2022

The best is to use pandas on a spark, it is virtually interchangeable so it just different API for Spark data frame

import pyspark.pandas as ps
 
psdf = ps.range(10)
sdf = psdf.to_spark().filter("id > 5")
sdf.show()

Mado · ‎10-23-2022

Thanks for your reply.

I want to apply Pandas function on PySpark DataFrame (like how I use Pandas on DataFrames on a local laptop). But, I think the above example uses PySpark function "filter".

Databricks Community

How to apply Pandas functions on PySpark DataFrame?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences