Databricks Community

User16752246553 · ‎06-10-2021

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

sajith_appukutt · ‎06-17-2021

>How does Vectorized Pandas UDF work?

Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.

>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?

If let's say subtract_mean is a grouped map - when you run

df.groupby("id").apply(subtract_mean).show()

partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.

>And is there a way to set the batch size?

You could configure spark.sql.execution.arrow.maxRecordsPerBatch

Databricks Community

How does Vectorized Pandas UDF work?

DAIS 2026 Speaker Spotlight Series #19 | Erin Butler

Solution Accelerator Series | Large Language Models (LLMs) for Customer Service Analytics

🌟 Community Pulse: Your Weekly Roundup! June 01 – 07, 2026

FREE TRAINING: Databricks Business Impact Accelerator

FLASH SALE: Save 50% on Summit Training ⚡