Databricks

cconnell · ‎10-27-2021

I am moving an existing, working pandas program into Databricks. I want to use the new pyspark.pandas library, and change my code as little as possible. It appears that I should do the following:

1) Add from pyspark import pandas as ps at the top

2) Change all occurrences of pd.pandas_function to ps.pandas_function

Is this correct?

Dan_Z · ‎10-27-2021

Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark

View solution in original post

Dan_Z · ‎10-27-2021

Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark

cconnell · ‎10-28-2021

Thank you. I am porting this code and will file issues as I see them… https://medium.com/@chuck.connell.3/vaccines-vs-mortality-correlation-at-the-county-level-922a10236a...

Hubert-Dudek · ‎10-28-2021

import pyspark.pandas as ps but as code coverage is 83% as @Dan Zafar said it can not guarantee that your old code will work. You can find more here on that blogpost https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

cconnell · ‎10-28-2021

I have read that blog. It was helpful but implied that I can move code from laptop pandas to spark pandas by changing one line of code, which does not seem true.

Dan_Z · ‎10-28-2021

That''s right, expect some minor refactoring.

Anonymous · ‎10-30-2021

Make sure to use the 10.0 Runtime which includes Spark 3.2

cconnell · ‎10-30-2021

Yes, did that. I am now porting the code, finding workarounds to problems, and keeping a list of issues. Will write it all up as an article on Medium.