topic Re: Liquid Clustering With more than 4 columns in Data Engineering

Liquid Clustering With more than 4 columns

Erfan — Mon, 07 Oct 2024 02:29:35 GMT

Hi there,

I’m trying to join a small table (a few million records) with a much larger table (around 1 TB in size, containing a few billion records).

The small table isn’t quite small enough to use Broadcast. Additionally, our join clause involves more than four columns. I attempted to enable Liquid Clustering on the large table, but it only supports up to four columns. I experimented with different combinations of four-column sets for Liquid Clustering, but none of them reduced the join time.

Do you have any recommendations for optimizing a query on a table with Liquid Clustering when the join criteria involve more than four columns?

Re: Liquid Clustering With more than 4 columns

filipniziol — Mon, 07 Oct 2024 06:55:05 GMT

Hi @Erfan ,

What you can do is to create an additional column that concatenates the values of multiple columns and then apply Liquid Clustering on that new column.

Re: Liquid Clustering With more than 4 columns

Erfan — Mon, 07 Oct 2024 07:02:58 GMT

Hi @filipniziol ,

Good idea. I'll try it and will come back with the result. Thanks!

Re: Liquid Clustering With more than 4 columns

Erfan — Wed, 09 Oct 2024 01:53:48 GMT

Unfortunatelly, since I am not the owner of the data, I am not allowed to add additional column. So I can't test it. But I guess your idead