cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Pandas_Udod max batch size not working in notebook

277745
New Contributor

Hello 

I am trying to set max batch size for pandas-udf in Databricks notebook, but in my tests it doesnโ€™t have any effect on size. spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)


should I set it in other way like during cluster configuration?

Thanks for clarification in advance!

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @277745It seems youโ€™re working with Pandas UDF in a Databricks Notebook and trying to set the maximum batch size.

Letโ€™s address your query:

  1. Setting Max Batch Size for Pandas UDF:

    • Youโ€™ve already taken the right steps by configuring the following Spark settings:
      • spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      • spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)
    • However, it appears that these settings havenโ€™t had the desired effect on the batch size during your tests.
  2. Cluster Configuration vs. Notebook Configuration:

    • The settings youโ€™ve applied are specific to the notebook session. If you want to set them globally for the entire cluster, you should consider adjusting the cluster configuration.
    • To set these parameters at the cluster level, follow these steps:
      1. Go to your Databricks workspace.
      2. Navigate to the Clusters tab.
      3. Click on the cluster youโ€™re using.
      4. In the Spark Config section, add the following configurations:
        • spark.sql.execution.arrow.enabled with a value of "true"
        • spark.sql.execution.arrow.maxRecordsPerBatch with a value of 1000000
      5. Click Save to apply the changes to the entire cluster.
  3. Additional Considerations:

    • Keep in mind that the maxRecordsPerBatch setting determines the maximum number of records per batch when using Arrow-based Pandas UDFs. However, it doesnโ€™t guarantee that every batch will have exactly that many records. The actual batch size may vary based on data distribution and other factors.
    • If you encounter any issues even after adjusting the cluster configuration, consider checking other factors such as data skew, memory availability, and resource utilization.

In summary, while notebook-level settings are useful for testing and experimentation, adjusting the cluster configuration will ensure consistent behavior across all notebooks running on that cluster. Feel free to explore this approach, and I hope this clarifies things for you! ๐Ÿ˜Š

If you have any further questions, feel free to ask!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group