11-10-2021 06:03 PM
I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Please advice.
I am using 9.1 LTS on the Azure environment.
11-10-2021 11:54 PM
that depends on the query, the table and what optimize you use (binning, z-order).
Delta lake by default collects statistics for the first 32 columns (which can be changed).
Building statistics for long strings is also more expensive than f.e. for integers.
Then there is also the fact that evaluating numbers is faster than strings.
What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.
So, not easy to pinpoint the difference.
11-10-2021 09:03 PM
Hi @ thbeh ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
11-10-2021 11:54 PM
that depends on the query, the table and what optimize you use (binning, z-order).
Delta lake by default collects statistics for the first 32 columns (which can be changed).
Building statistics for long strings is also more expensive than f.e. for integers.
Then there is also the fact that evaluating numbers is faster than strings.
What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.
So, not easy to pinpoint the difference.
11-11-2021 09:52 PM
@Werner Stinckens Thanks for your explanation.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.