โ11-30-2021 04:15 PM
Would like to better understand the advantage of writing a python notebook in pyspark vs pandas. Does the entire notebook need to be written in pyspark to realize the performance benefits. I currently have a script using pandas for all my transformations - can I just replace the 'inefficient' blocks to pyspark and keep the smaller/less costly transformations in pandas? Thanks!
โ11-30-2021 04:17 PM
How do you know the pandas sections are less efficient?
โ11-30-2021 04:23 PM
I don't. This is moreso a hypothetical question. I do have a script where I'm using pandas, but I want to account for my dataset growing larger over time. If my runtime ends up taking minutes... then would I benefit from using pyspark instead? And if yes, can I just replace the high workload intensive tasks and keep everything else in pandas.... note: this is my first time in databricks so some of my assumptions may be off here. Please feel free to correct me if I'm wrong!
โ12-01-2021 02:31 AM
Ok, basically it comes down to this:
Pandas/Python is pretty good in data processing, as long as it can run on a single node.
If you do not have issues with processing your data on a single node, pandas is fine.
However, when you start getting OOM messages etc, it can be a good idea to look at pyspark.pandas.
Spark will use multiple nodes to process the data.
Of course this means you will have to rewrite code. But with the latest additions of databricks this will not be a daunting task, here is an interesting article:
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
Besides that you can still use 'ordinary' pandas or python. But beware that this code will be executed on the driver only (so in single node mode).
You can mix pandas and pyspark.pandas but it is not guaranteed that this will be faster than doing everything in pyspark.pandas because it will break the processing logic of spark into multiple parts.
But check out the article and see where it gets you.
โ12-01-2021 06:43 AM
Thanks for the clarification - exactly what I was looking for!
โ12-01-2021 03:04 AM
as @Werner Stinckensโ said "Spark will use multiple nodes to process the data".
If you like to use Pandas code there is Pandas API for Spark (since 3.2). What you need to do is just to import different library:
# THIS NOT: from pandas import read_csv
from pyspark.pandas import read_csv
pdf = read_csv("data.csv")
Here is more info https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
โ12-01-2021 06:56 AM
When I run the following code:
import pyspark.pandas as ps
I receive an error: No module named 'pyspark.pandas'
Do you know how I can correct this?
โ12-01-2021 06:58 AM
you have to run one of the latest version of databricks, with spark 3.2 (as from 10.0 i think).
Before that pyspark.pandas was called koalas. So if you are on a lower version, you should use koalas. but it is the same thing.
โ12-01-2021 07:01 AM
Yes, Databricks Runtime 10.0 or later.
โ12-01-2021 07:09 AM
you need to have runtime 10.0 or 10.1
โ12-01-2021 06:10 AM
It is important to understand the difference between: 1) pandas on a single computer, which is what most people mean when they talk about pandas; 2) pandas on Spark by using the new pandas API in PySpark 3.2. It is not clear which @Paras Patelโ is asking about.
For a detailed discussion about pandas on PySpark, see my article https://medium.com/@chuck.connell.3/pandas-on-spark-current-issues-and-workarounds-dc9ed30840ce
โ12-01-2021 06:15 AM
Thank you for article. reading it now ๐
โ12-01-2021 07:54 AM
Thanks for clarification on this to all of you... helps alot. Unfortunately, I'm on an organization cluster so I can't upgrade or have the permission to create a new cluster so will into koalas as an alternative to pyspark.pandas.
โ12-01-2021 08:05 AM
You can use the free Community Edition of Databricks that includes 10.0 runtime.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group