How does Vectorized Pandas UDF work?

User16752246553 — Thu, 10 Jun 2021 17:57:58 GMT

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

Re: How does Vectorized Pandas UDF work?

sajith_appukutt — Fri, 18 Jun 2021 00:23:35 GMT

>How does Vectorized Pandas UDF work?

Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.

>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?

If let's say subtract_mean is a grouped map - when you run

df.groupby("id").apply(subtract_mean).show()

partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.

>And is there a way to set the batch size?

You could configure spark.sql.execution.arrow.maxRecordsPerBatch

topic Re: How does Vectorized Pandas UDF work? in Data Engineering

How does Vectorized Pandas UDF work?

Re: How does Vectorized Pandas UDF work?