What is the upper bound limit for dataSkippingNumIndexedCols, to keeps stats in delta log file?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-17-2023 01:22 AM
Is there an upper bound of number that i can assign to delta.dataSkippingNumIndexedCols for computing statistics. Is there some tradeoff benchmark available for increasing this number beyond 32.
- Labels:
-
DeltaLog
-
Limit
-
Number
-
Statistics
-
Upper Bound
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-08-2023 08:21 PM
@Chhavi Bansal :
The delta.dataSkippingNumIndexedCols configuration property controls the maximum number of columns that Delta Lake will build statistics on during data skipping. By default, this value is set to 32. There is no hard upper bound on the number that can be assigned to this configuration property, but setting it to a very large number can have a negative impact on performance and memory usage. The optimal value for this configuration property will depend on the characteristics of your data and the workload that you are running. Delta Lake documentation recommends setting delta.dataSkippingNumIndexedCols to be equal to or slightly larger than the number of columns that you expect to be commonly used in predicates for filtering data. You can also adjust this value based on the size of your data and the resources available to your cluster.
As for the tradeoff benchmark, I am not aware of any specific benchmark related to this configuration property. However, you can monitor the performance and memory usage of your Delta Lake workload with different values of this configuration property to determine the optimal value for your specific use case.
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/97567C72181EBE789E1F0FD869E4C89B/responsive_peak/images/icon_anonymous_message.png)