- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-27-2021 10:00 AM
I am moving an existing, working pandas program into Databricks. I want to use the new pyspark.pandas library, and change my code as little as possible. It appears that I should do the following:
1) Add from pyspark import pandas as ps at the top
2) Change all occurrences of pd.pandas_function to ps.pandas_function
Is this correct?
- Labels:
-
Data Science
-
Pandas
-
Pyspark
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-27-2021 08:21 PM
Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-27-2021 08:21 PM
Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-28-2021 06:38 AM
Thank you. I am porting this code and will file issues as I see themโฆ https://medium.com/@chuck.connell.3/vaccines-vs-mortality-correlation-at-the-county-level-922a10236a...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-28-2021 02:28 AM
import pyspark.pandas as ps but as code coverage is 83% as @Dan Zafarโ said it can not guarantee that your old code will work. You can find more here on that blogpost https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-28-2021 06:40 AM
I have read that blog. It was helpful but implied that I can move code from laptop pandas to spark pandas by changing one line of code, which does not seem true.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-28-2021 08:24 AM
That''s right, expect some minor refactoring.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-30-2021 03:04 AM
Make sure to use the 10.0 Runtime which includes Spark 3.2
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-30-2021 05:40 AM
Yes, did that. I am now porting the code, finding workarounds to problems, and keeping a list of issues. Will write it all up as an article on Medium.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-30-2021 02:00 PM
Cool, make sure to link it here when you're finished.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-30-2021 03:30 PM
Yes. Do you want me to create official issues at apache/spark, or let someone on your team do it from my notes?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ10-30-2021 04:32 PM
Official issues is the best way to go. That way you can reference them in the future and show your work off!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ11-01-2021 08:01 AM
I created some issues, such as this one. Please let me know if I did anything wrong, so I can fix them. Thanks!

