Databricks Community

User16752246553 · ‎06-10-2021

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

sajith_appukutt · ‎06-17-2021

>How does Vectorized Pandas UDF work?

Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.

>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?

If let's say subtract_mean is a grouped map - when you run

df.groupby("id").apply(subtract_mean).show()

partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.

>And is there a way to set the batch size?

You could configure spark.sql.execution.arrow.maxRecordsPerBatch

Databricks Community

How does Vectorized Pandas UDF work?

Databricks AMER Learning Festival | Virtual Training

Introducing the Genie Hub: Ask Questions, Share Builds, and Master Conversational Analytics

🌟 Community Pulse: Your Weekly Roundup! July 13 – 19, 2026

Solution Accelerator Series | Social Determinants of Health

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions