Databricks Community

Anonymous · ‎11-10-2021

I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Please advice.

I am using 9.1 LTS on the Azure environment.

-werners- · ‎11-10-2021

that depends on the query, the table and what optimize you use (binning, z-order).

Delta lake by default collects statistics for the first 32 columns (which can be changed).

Building statistics for long strings is also more expensive than f.e. for integers.

Then there is also the fact that evaluating numbers is faster than strings.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-...

What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.

So, not easy to pinpoint the difference.

View solution in original post

-werners- · ‎11-10-2021

that depends on the query, the table and what optimize you use (binning, z-order).

Delta lake by default collects statistics for the first 32 columns (which can be changed).

Building statistics for long strings is also more expensive than f.e. for integers.

Then there is also the fact that evaluating numbers is faster than strings.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-...

What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.

So, not easy to pinpoint the difference.

Anonymous · ‎11-11-2021

@Werner Stinckens Thanks for your explanation.

Databricks Community

OPTIMIZE

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST