cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pyspark vs Pandas

pjp94
Contributor

Would like to better understand the advantage of writing a python notebook in pyspark vs pandas. Does the entire notebook need to be written in pyspark to realize the performance benefits. I currently have a script using pandas for all my transformations - can I just replace the 'inefficient' blocks to pyspark and keep the smaller/less costly transformations in pandas? Thanks!

13 REPLIES 13

cconnell
Contributor II

How do you know the pandas sections are less efficient?

I don't. This is moreso a hypothetical question. I do have a script where I'm using pandas, but I want to account for my dataset growing larger over time. If my runtime ends up taking minutes... then would I benefit from using pyspark instead? And if yes, can I just replace the high workload intensive tasks and keep everything else in pandas.... note: this is my first time in databricks so some of my assumptions may be off here. Please feel free to correct me if I'm wrong!

-werners-
Esteemed Contributor III

Ok, basically it comes down to this:

Pandas/Python is pretty good in data processing, as long as it can run on a single node.

If you do not have issues with processing your data on a single node, pandas is fine.

However, when you start getting OOM messages etc, it can be a good idea to look at pyspark.pandas.

Spark will use multiple nodes to process the data.

Of course this means you will have to rewrite code. But with the latest additions of databricks this will not be a daunting task, here is an interesting article:

https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

Besides that you can still use 'ordinary' pandas or python. But beware that this code will be executed on the driver only (so in single node mode).

You can mix pandas and pyspark.pandas but it is not guaranteed that this will be faster than doing everything in pyspark.pandas because it will break the processing logic of spark into multiple parts.

But check out the article and see where it gets you.

Thanks for the clarification - exactly what I was looking for!

Hubert-Dudek
Esteemed Contributor III

as @Werner Stinckens​ said "Spark will use multiple nodes to process the data".

If you like to use Pandas code there is Pandas API for Spark (since 3.2). What you need to do is just to import different library:

# THIS NOT: from pandas import read_csv
from pyspark.pandas import read_csv
pdf = read_csv("data.csv")

Here is more info https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

When I run the following code:

import pyspark.pandas as ps

I receive an error: No module named 'pyspark.pandas'

Do you know how I can correct this?

-werners-
Esteemed Contributor III

you have to run one of the latest version of databricks, with spark 3.2 (as from 10.0 i think).

Before that pyspark.pandas was called koalas. So if you are on a lower version, you should use koalas. but it is the same thing.

Yes, Databricks Runtime 10.0 or later.

Hubert-Dudek
Esteemed Contributor III

you need to have runtime 10.0 or 10.1

cconnell
Contributor II

It is important to understand the difference between: 1) pandas on a single computer, which is what most people mean when they talk about pandas; 2) pandas on Spark by using the new pandas API in PySpark 3.2. It is not clear which @Paras Patel​  is asking about.

For a detailed discussion about pandas on PySpark, see my article https://medium.com/@chuck.connell.3/pandas-on-spark-current-issues-and-workarounds-dc9ed30840ce

Hubert-Dudek
Esteemed Contributor III

Thank you for article. reading it now 🙂

pjp94
Contributor

Thanks for clarification on this to all of you... helps alot. Unfortunately, I'm on an organization cluster so I can't upgrade or have the permission to create a new cluster so will into koalas as an alternative to pyspark.pandas.

cconnell
Contributor II

You can use the free Community Edition of Databricks that includes 10.0 runtime.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group