topic Re: How to resuse Pandas code in PySpark? in Data Engineering

How to resuse Pandas code in PySpark?

User16783853906 — Tue, 08 Jun 2021 21:44:50 GMT

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

Re: How to resuse Pandas code in PySpark?

sean_owen — Thu, 17 Jun 2021 23:25:44 GMT

That is exactly what koalas is for! it's a reimplementation of most of the pandas API on top of Spark. You should be able to run your pandas code as-is, or with little modification, using koalas, and let it distribute on Spark. https://docs.databricks.com/languages/koalas.html

Re: How to resuse Pandas code in PySpark?

User16826994223 — Wed, 23 Jun 2021 14:27:38 GMT

It has become so simple once Koalas came , in place of importing Import Pandas as pd you just have to do

import databricks.koalas as pd

I kept as pd intentionally so that you do not need to change the other code , run the code , there may be some issue you can face that can be answered with koalas documentation , so its easy

Re: How to resuse Pandas code in PySpark?

User16783853906 — Wed, 23 Jun 2021 21:28:25 GMT

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html