Databricks Community

cconnell · ‎10-27-2021

I am moving an existing, working pandas program into Databricks. I want to use the new pyspark.pandas library, and change my code as little as possible. It appears that I should do the following:

1) Add from pyspark import pandas as ps at the top

2) Change all occurrences of pd.pandas_function to ps.pandas_function

Is this correct?

Dan_Z · ‎10-27-2021

Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark

View solution in original post

Dan_Z · ‎10-27-2021

Yes- that is a great start. Right now code coverage is at 83%, and we are shooting for 90%. Please file an issue if you find a gap you need filled. Please note- pandas on Spark (koalas) does some interesting things with the way it distributes the indexes of large dataframes. It may good to review starting at 20:20 here: https://databricks.com/session_na20/koalas-making-an-easy-transition-from-pandas-to-apache-spark

cconnell · ‎10-28-2021

Thank you. I am porting this code and will file issues as I see them… https://medium.com/@chuck.connell.3/vaccines-vs-mortality-correlation-at-the-county-level-922a10236a...

Hubert-Dudek · ‎10-28-2021

import pyspark.pandas as ps but as code coverage is 83% as @Dan Zafar said it can not guarantee that your old code will work. You can find more here on that blogpost https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

My blog: https://databrickster.medium.com/

cconnell · ‎10-28-2021

I have read that blog. It was helpful but implied that I can move code from laptop pandas to spark pandas by changing one line of code, which does not seem true.

Dan_Z · ‎10-28-2021

That''s right, expect some minor refactoring.

Anonymous · ‎10-30-2021

Make sure to use the 10.0 Runtime which includes Spark 3.2

cconnell · ‎10-30-2021

Yes, did that. I am now porting the code, finding workarounds to problems, and keeping a list of issues. Will write it all up as an article on Medium.

Anonymous · ‎10-30-2021

Cool, make sure to link it here when you're finished.

cconnell · ‎10-30-2021

Yes. Do you want me to create official issues at apache/spark, or let someone on your team do it from my notes?

Anonymous · ‎10-30-2021

Official issues is the best way to go. That way you can reference them in the future and show your work off!

cconnell · ‎11-01-2021

I created some issues, such as this one. Please let me know if I did anything wrong, so I can fix them. Thanks!

https://issues.apache.org/jira/browse/SPARK-37180

Databricks Community

What is the proper way to import the new pyspark.pandas library?

DAIS 2026 Speaker Spotlight Series #4 | Archika Dogra

Databricks Community Champion - May 2026 - Balaji J

Solution Accelerator Series | Media Mix Modeling (MMM)

DAIS 2026 | Community Virtual Contest – Showcase Your Skills & Win Exclusive Swag

DAIS registrants: apply for the Apps & Agents for Good Hackathon 2026