Databricks Community

User16752246553 · ‎06-10-2021

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

sajith_appukutt · ‎06-17-2021

>How does Vectorized Pandas UDF work?

Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.

>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?

If let's say subtract_mean is a grouped map - when you run

df.groupby("id").apply(subtract_mean).show()

partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.

>And is there a way to set the batch size?

You could configure spark.sql.execution.arrow.maxRecordsPerBatch

Databricks Community

How does Vectorized Pandas UDF work?

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks