Databricks

User16783853906 · ‎06-08-2021

I have single threaded Pandas code that is both not yet supported by Koalas nor easy to reimplement in PySpark. I would like to distribute this workload using Spark without rewriting all my Pandas code - is this possible?

User16783853906 · ‎06-23-2021

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

View solution in original post

sean_owen · ‎06-17-2021

That is exactly what koalas is for! it's a reimplementation of most of the pandas API on top of Spark. You should be able to run your pandas code as-is, or with little modification, using koalas, and let it distribute on Spark. https://docs.databricks.com/languages/koalas.html

User16826994223 · ‎06-23-2021

It has become so simple once Koalas came , in place of importing Import Pandas as pd you just have to do

import databricks.koalas as pd

I kept as pd intentionally so that you do not need to change the other code , run the code , there may be some issue you can face that can be answered with koalas documentation , so its easy

User16783853906 · ‎06-23-2021

This is for a specific scenario where the code is not yet supported by Koalas. One approach to consider is using a Pandas UDF, and splitting up the work in a way that allows your processing to move forward. This notebook is a great example of taking single node processing and parallelizing it using a Pandas UDF, although it may not be a perfect fit for your challenge - https://pages.databricks.com/rs/094-YMS-629/images/Fine-Grained-Time-Series-Forecasting.html

Databricks

How to resuse Pandas code in PySpark?

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI