cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to resuse Pandas code in PySpark?

User16783853906
Contributor III

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

View solution in original post

3 REPLIES 3

sean_owen
Honored Contributor II
Honored Contributor II

That is exactly what koalas is for! it's a reimplementation of most of the pandas API on top of Spark. You should be able to run your pandas code as-is, or with little modification, using koalas, and let it distribute on Spark. https://docs.databricks.com/languages/koalas.html

User16826994223
Honored Contributor III

It has become so simple once Koalas came , in place of importing Import Pandas as pd you just have to do

import databricks.koalas as pd 

I kept as pd intentionally so that you do not need to change the other code , run the code , there may be some issue you can face that can be answered with koalas documentation , so its easy

User16783853906
Contributor III

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!