cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

OPTIMIZE

Anonymous
Not applicable

I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Please advice.

I am using 9.1 LTS on the Azure environment.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

that depends on the query, the table and what optimize you use (binning, z-order).

Delta lake by default collects statistics for the first 32 columns (which can be changed).

Building statistics for long strings is also more expensive than f.e. for integers.

Then there is also the fact that evaluating numbers is faster than strings.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-...

What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.

So, not easy to pinpoint the difference.

View solution in original post

3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @ thbeh ! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

-werners-
Esteemed Contributor III

that depends on the query, the table and what optimize you use (binning, z-order).

Delta lake by default collects statistics for the first 32 columns (which can be changed).

Building statistics for long strings is also more expensive than f.e. for integers.

Then there is also the fact that evaluating numbers is faster than strings.

https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-copy-...

What could also play is auto scaling on your cluster, or spot instances which are abandoned etc.

So, not easy to pinpoint the difference.

Anonymous
Not applicable

@Werner Stinckensโ€‹  Thanks for your explanation.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group