06-15-2025 07:29 AM
Hello,
I have a table with approximately 300 million records. It weighs 3.4 GB and consists of 305 files.
I wanted to create liquid clustering for it and chose a date column as the key for clustering.
When I created a new table with the above details based on the original table, the size of the table increased to nearly 5 GB. Although the number of files indeed decreased to 24, query performance was slower on the new table, and in the performance metrics, I noticed that the amount of data being read was larger compared to the original table without clustering.
What am I missing here? Why is this happening?
Thanks!
06-15-2025 08:31 AM
06-16-2025 05:23 AM
I really appreciate your detailed response.
I tried changing the key to a column with higher cardinality—a column that includes both date and time (timestamp)—to achieve a more even distribution.
Now the size has dropped to 4.3 GiB, and the number of files is 64.
Currently, the queries performed similarly, with a slight advantage for the table with the CLUSTER BY, when filtering on the clustered column. However, the amount of MB read was still higher.
I was under the impression that, unlike PARTITION BY, I don’t need to find a column with an even distribution for CLUSTER BY.
Could you clarify in which cases I would see a significant improvement with a CLUSTER BY table compared to a table with no optimization strategy at all?
Thank you!
06-15-2025 10:25 AM
Hi @OODataEng, could you please elaborate on what exact command you used to create a new table using the older one?
06-15-2025 11:00 AM
Hi
create table x cluster by (col1,col2,col3)
as select * from table y
and than I run optimize command on table x
Thank you
06-15-2025 11:03 AM
I tried it with 1 column as a clustered key and also with 3 columns, both scenarios the size of the table increased
06-16-2025 08:25 AM
Hey @OODDATAEng
To create a new table in Databricks using the schema and data from an existing table, you can use the CREATE TABLE AS SELECT command. This command allows you to define a new table based on the results of a SELECT query executed on the existing table.
Here's a breakdown of the command and its components:
Code
CREATE TABLE new_table_name
AS
SELECT * FROM existing_table_name;
CREATE TABLE new_table_name:
 
					
				
				
			
		
 
					
				
				
			
		
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now