>How does Vectorized Pandas UDF work?
Here is a video explaining the internals of Pandas UDFs (a.k.a. Vectorized UDFs) - https://youtu.be/UZl0pHG-2HA?t=123 . They use Apache Arrow, to exchange data directly between JVM and Python driver/executors with near-zero (de)serialization cost.
>Do Vectorized Pandas UDFs apply to batches of data sequentially or in parallel?
If let's say subtract_mean is a grouped map - when you run
df.groupby("id").apply(subtract_mean).show()
partitions in spark are converted into arrow record batches and depending on the cardinality of id, multiple batches would be processed in parallel.
>And is there a way to set the batch size?
You could configure spark.sql.execution.arrow.maxRecordsPerBatch