cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How does Vectorized Pandas UDF work?

User16752246553
Databricks Employee
Databricks Employee

Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel? And is there a way to set the batch size?

1 REPLY 1

sajith_appukutt
Databricks Employee
Databricks Employee

>How does Vectorized Pandas UDF work?

Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.

>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?

If let's say subtract_mean is a grouped map - when you run

df.groupby("id").apply(subtract_mean).show()

partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.

>And is there a way to set the batch size?

You could configure spark.sql.execution.arrow.maxRecordsPerBatch