- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-12-2021 01:03 AM
Ok let's see.
You have a spark dataframe, which is distributed by nature.
On the other hand you use pandas, which is not distributed.
Also your function is plain python, not pyspark.
This will be processed by the driver, so not distributed.
However, the dataframe itself will be processed by the workers.
So the workers want to do something but the code runs on the driver. It could be that. Not sure though, but the fact it works as single node makes me think your code is not executed in a distributed environment.
That is why I mentioned the collect(). This will collect the spark dataframe on the driver so the workers are bypassed in your case.
Moving from python to pyspark takes some time to understand.
This blog explains some interesting topics:
https://medium.com/hashmapinc/5-steps-to-converting-python-jobs-to-pyspark-4b9988ad027a
Using koalas instead of pandas f.e.