cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to resuse Pandas code in PySpark?

User16783853906
Contributor III

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

1 ACCEPTED SOLUTION

Accepted Solutions

User16783853906
Contributor III

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

View solution in original post

3 REPLIES 3

sean_owen
Honored Contributor II
Honored Contributor II

That is exactly what koalas is for! it's a reimplementation of most of the pandas API on top of Spark. You should be able to run your pandas code as-is, or with little modification, using koalas, and let it distribute on Spark. https://docs.databricks.com/languages/koalas.html

User16826994223
Honored Contributor III

It has become so simple once Koalas came , in place of importing Import Pandas as pd you just have to do

import databricks.koalas as pd 

I kept as pd intentionally so that you do not need to change the other code , run the code , there may be some issue you can face that can be answered with koalas documentation , so its easy

User16783853906
Contributor III

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.