cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Pandas_Udod max batch size not working in notebook

277745
New Contributor

Hello 

I am trying to set max batch size for pandas-udf in Databricks notebook, but in my tests it doesnโ€™t have any effect on size. spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)


should I set it in other way like during cluster configuration?

Thanks for clarification in advance!

 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @277745It seems youโ€™re working with Pandas UDF in a Databricks Notebook and trying to set the maximum batch size.

Letโ€™s address your query:

  1. Setting Max Batch Size for Pandas UDF:

    • Youโ€™ve already taken the right steps by configuring the following Spark settings:
      • spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      • spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)
    • However, it appears that these settings havenโ€™t had the desired effect on the batch size during your tests.
  2. Cluster Configuration vs. Notebook Configuration:

    • The settings youโ€™ve applied are specific to the notebook session. If you want to set them globally for the entire cluster, you should consider adjusting the cluster configuration.
    • To set these parameters at the cluster level, follow these steps:
      1. Go to your Databricks workspace.
      2. Navigate to the Clusters tab.
      3. Click on the cluster youโ€™re using.
      4. In the Spark Config section, add the following configurations:
        • spark.sql.execution.arrow.enabled with a value of "true"
        • spark.sql.execution.arrow.maxRecordsPerBatch with a value of 1000000
      5. Click Save to apply the changes to the entire cluster.
  3. Additional Considerations:

    • Keep in mind that the maxRecordsPerBatch setting determines the maximum number of records per batch when using Arrow-based Pandas UDFs. However, it doesnโ€™t guarantee that every batch will have exactly that many records. The actual batch size may vary based on data distribution and other factors.
    • If you encounter any issues even after adjusting the cluster configuration, consider checking other factors such as data skew, memory availability, and resource utilization.

In summary, while notebook-level settings are useful for testing and experimentation, adjusting the cluster configuration will ensure consistent behavior across all notebooks running on that cluster. Feel free to explore this approach, and I hope this clarifies things for you! ๐Ÿ˜Š

If you have any further questions, feel free to ask!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!