Databricks Community

Erfan · ‎10-06-2024

Hi there,

I’m trying to join a small table (a few million records) with a much larger table (around 1 TB in size, containing a few billion records).

The small table isn’t quite small enough to use Broadcast. Additionally, our join clause involves more than four columns. I attempted to enable Liquid Clustering on the large table, but it only supports up to four columns. I experimented with different combinations of four-column sets for Liquid Clustering, but none of them reduced the join time.

Do you have any recommendations for optimizing a query on a table with Liquid Clustering when the join criteria involve more than four columns?

filipniziol · ‎10-06-2024

Hi @Erfan ,

What you can do is to create an additional column that concatenates the values of multiple columns and then apply Liquid Clustering on that new column.

View solution in original post

filipniziol · ‎10-06-2024

Hi @Erfan ,

What you can do is to create an additional column that concatenates the values of multiple columns and then apply Liquid Clustering on that new column.