Databricks

277745 · ‎02-24-2024

Hello

I am trying to set max batch size for pandas-udf in Databricks notebook, but in my tests it doesn’t have any effect on size. spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)

should I set it in other way like during cluster configuration?

Thanks for clarification in advance!

Kaniz · ‎03-14-2024

Hi @277745, It seems you’re working with Pandas UDF in a Databricks Notebook and trying to set the maximum batch size.

Let’s address your query:

Setting Max Batch Size for Pandas UDF:
- You’ve already taken the right steps by configuring the following Spark settings:
  - spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  - spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)
- However, it appears that these settings haven’t had the desired effect on the batch size during your tests.
Cluster Configuration vs. Notebook Configuration:
- The settings you’ve applied are specific to the notebook session. If you want to set them globally for the entire cluster, you should consider adjusting the cluster configuration.
- To set these parameters at the cluster level, follow these steps:
  1. Go to your Databricks workspace.
  2. Navigate to the Clusters tab.
  3. Click on the cluster you’re using.
  4. In the Spark Config section, add the following configurations:
    - spark.sql.execution.arrow.enabled with a value of "true"
    - spark.sql.execution.arrow.maxRecordsPerBatch with a value of 1000000
  5. Click Save to apply the changes to the entire cluster.
Additional Considerations:
- Keep in mind that the maxRecordsPerBatch setting determines the maximum number of records per batch when using Arrow-based Pandas UDFs. However, it doesn’t guarantee that every batch will have exactly that many records. The actual batch size may vary based on data distribution and other factors.
- If you encounter any issues even after adjusting the cluster configuration, consider checking other factors such as data skew, memory availability, and resource utilization.

In summary, while notebook-level settings are useful for testing and experimentation, adjusting the cluster configuration will ensure consistent behavior across all notebooks running on that cluster. Feel free to explore this approach, and I hope this clarifies things for you! 😊

If you have any further questions, feel free to ask!

Databricks

Pandas_Udod max batch size not working in notebook

Building DBRX-class Custom LLMs with Mosaic AI Training

Accurate, Safe and Governed: How to Move GenAI from POC to Production

Exciting Announcement: Introducing our Learning Library!

Databricks Community Social, May 2024 - Speaker session around Training offerings

🔔 Attention Databricks Academy Users: SSO Implementation Incoming! Secure Your Account Today!