06-14-2024 04:17 AM
Hello,
I have a large Delta table with a size of 29TB. I implemented Liquid Clustering on this table, but running a simple MIN MAX query on the set cluster column is still extremely slow. I have already optimized the table. Am I missing something in my implementation?
06-14-2024 07:15 AM
Hi
this operation should take seconds because it use the precomputed statistics for the table. Then few elements to verify:
if you need to read all data, then i saw in the image that only 4 task are running means 4 cores, then i would recommend to use a bigger cluster in memory and cores (scale up) with fewer nodes to reduce the shuffle.
06-14-2024 04:58 AM - edited 06-14-2024 04:59 AM
What is the data type of the field you're querying?
All I can see is the name "_PartitionColumnUTC_". Judging by the name it is a date/timestamp but this is me making assumptions.
06-14-2024 11:22 PM
Hello, this is a type integer in the format YYYYMMDD
06-15-2024 11:45 AM
That is strange, min/max of integers should be able to be retrieved very quickly, especially if they are partitioned columns. You are 100% sure it is an integer column and not a string? You didn't specify any filter clauses in your queries that would potentially trigger a full table scan?
06-19-2024 11:20 PM
Hi Jaco,
Using the ANALYZE TABLE command fixed the issue; however, I am still experiencing very slow queries on the STRING type of a different cluster key. Does liquid clustering not support the STRING type very well?
06-19-2024 11:31 PM
Ah then your table had to have its statistics refilled, glad it works now.
As for string types, it should work just as well.
"slow" is a bit subjective maybe. You have not yet mentioned the warehouse tier/cluster config, are you using sufficient processing power?
06-14-2024 07:15 AM
Hi
this operation should take seconds because it use the precomputed statistics for the table. Then few elements to verify:
if you need to read all data, then i saw in the image that only 4 task are running means 4 cores, then i would recommend to use a bigger cluster in memory and cores (scale up) with fewer nodes to reduce the shuffle.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now