cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Pandas_Udod max batch size not working in notebook

277745
New Contributor

Hello 

I am trying to set max batch size for pandas-udf in Databricks notebook, but in my tests it doesn’t have any effect on size. spark.conf.set("spark.sql.execution.arrow.enabled", "true")

spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)


should I set it in other way like during cluster configuration?

Thanks for clarification in advance!

 

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @277745It seems you’re working with Pandas UDF in a Databricks Notebook and trying to set the maximum batch size.

Let’s address your query:

  1. Setting Max Batch Size for Pandas UDF:

    • You’ve already taken the right steps by configuring the following Spark settings:
      • spark.conf.set("spark.sql.execution.arrow.enabled", "true")
      • spark.conf.set('spark.sql.execution.arrow.maxRecordsPerBatch', 1000000)
    • However, it appears that these settings haven’t had the desired effect on the batch size during your tests.
  2. Cluster Configuration vs. Notebook Configuration:

    • The settings you’ve applied are specific to the notebook session. If you want to set them globally for the entire cluster, you should consider adjusting the cluster configuration.
    • To set these parameters at the cluster level, follow these steps:
      1. Go to your Databricks workspace.
      2. Navigate to the Clusters tab.
      3. Click on the cluster you’re using.
      4. In the Spark Config section, add the following configurations:
        • spark.sql.execution.arrow.enabled with a value of "true"
        • spark.sql.execution.arrow.maxRecordsPerBatch with a value of 1000000
      5. Click Save to apply the changes to the entire cluster.
  3. Additional Considerations:

    • Keep in mind that the maxRecordsPerBatch setting determines the maximum number of records per batch when using Arrow-based Pandas UDFs. However, it doesn’t guarantee that every batch will have exactly that many records. The actual batch size may vary based on data distribution and other factors.
    • If you encounter any issues even after adjusting the cluster configuration, consider checking other factors such as data skew, memory availability, and resource utilization.

In summary, while notebook-level settings are useful for testing and experimentation, adjusting the cluster configuration will ensure consistent behavior across all notebooks running on that cluster. Feel free to explore this approach, and I hope this clarifies things for you! 😊

If you have any further questions, feel free to ask!