Pyspark vs Pandas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-30-2021 04:15 PM
Would like to better understand the advantage of writing a python notebook in pyspark vs pandas. Does the entire notebook need to be written in pyspark to realize the performance benefits. I currently have a script using pandas for all my transformations - can I just replace the 'inefficient' blocks to pyspark and keep the smaller/less costly transformations in pandas? Thanks!
- Labels:
-
Pandas
-
Pyspark
-
Python
-
Python notebook
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-30-2021 04:17 PM
How do you know the pandas sections are less efficient?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-30-2021 04:23 PM
I don't. This is moreso a hypothetical question. I do have a script where I'm using pandas, but I want to account for my dataset growing larger over time. If my runtime ends up taking minutes... then would I benefit from using pyspark instead? And if yes, can I just replace the high workload intensive tasks and keep everything else in pandas.... note: this is my first time in databricks so some of my assumptions may be off here. Please feel free to correct me if I'm wrong!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 02:31 AM
Ok, basically it comes down to this:
Pandas/Python is pretty good in data processing, as long as it can run on a single node.
If you do not have issues with processing your data on a single node, pandas is fine.
However, when you start getting OOM messages etc, it can be a good idea to look at pyspark.pandas.
Spark will use multiple nodes to process the data.
Of course this means you will have to rewrite code. But with the latest additions of databricks this will not be a daunting task, here is an interesting article:
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
Besides that you can still use 'ordinary' pandas or python. But beware that this code will be executed on the driver only (so in single node mode).
You can mix pandas and pyspark.pandas but it is not guaranteed that this will be faster than doing everything in pyspark.pandas because it will break the processing logic of spark into multiple parts.
But check out the article and see where it gets you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 06:43 AM
Thanks for the clarification - exactly what I was looking for!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 03:04 AM
as @Werner Stinckens said "Spark will use multiple nodes to process the data".
If you like to use Pandas code there is Pandas API for Spark (since 3.2). What you need to do is just to import different library:
# THIS NOT: from pandas import read_csv
from pyspark.pandas import read_csv
pdf = read_csv("data.csv")
Here is more info https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 06:56 AM
When I run the following code:
import pyspark.pandas as ps
I receive an error: No module named 'pyspark.pandas'
Do you know how I can correct this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 06:58 AM
you have to run one of the latest version of databricks, with spark 3.2 (as from 10.0 i think).
Before that pyspark.pandas was called koalas. So if you are on a lower version, you should use koalas. but it is the same thing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 07:01 AM
Yes, Databricks Runtime 10.0 or later.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 07:09 AM
you need to have runtime 10.0 or 10.1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 06:10 AM
It is important to understand the difference between: 1) pandas on a single computer, which is what most people mean when they talk about pandas; 2) pandas on Spark by using the new pandas API in PySpark 3.2. It is not clear which @Paras Patel is asking about.
For a detailed discussion about pandas on PySpark, see my article https://medium.com/@chuck.connell.3/pandas-on-spark-current-issues-and-workarounds-dc9ed30840ce
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 06:15 AM
Thank you for article. reading it now 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 07:54 AM
Thanks for clarification on this to all of you... helps alot. Unfortunately, I'm on an organization cluster so I can't upgrade or have the permission to create a new cluster so will into koalas as an alternative to pyspark.pandas.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-01-2021 08:05 AM
You can use the free Community Edition of Databricks that includes 10.0 runtime.

