Databricks Community

Direo · ‎09-04-2024

Hello everyone,

I'm exploring ways to perform clustering on a feature store table that I've created using the FeatureEngineeringClient in Databricks, and I'm particularly interested in applying liquid clustering to one of the columns.

Here’s the scenario:

I created a feature store table using the following code:

from databricks.feature_engineering import FeatureEngineeringClient, FeatureLookup

# Initialize the FeatureEngineeringClient
fe = FeatureEngineeringClient()

# Define the feature store table with primary key and schema
fe.create_table(
name=table_name,
primary_keys=["wine_id"],
schema=features_df.schema,
description="wine features"
)

# Write data to the feature store table
fe.write_table(
name=table_name,
df=features_df,
mode="merge"
)

Now that I have the feature store table in place with various features, I'd like to apply liquid clustering to one of the columns (or multiple columns).

My Question:

How can I implement liquid clustering on this feature store table in Python? I know that I can enable liquid clustering on an existing unpartitioned Delta table using the following syntax:

ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>)

but that requires SQL.

Any help or code examples on this would be greatly appreciated!

Thank you!

Sidhant07 · ‎12-09-2024

Hi,

# Set the table name and clustering columns
table_name = "feature_store_table"
clustering_columns = ["column1", "column2"]

# Build the SQL command
sql_command = f"ALTER TABLE {table_name} CLUSTER BY ({', '.join(clustering_columns)})"

# Execute the SQL command
spark.sql(sql_command)

Databricks Community

Liquid Clustering on a Feature Store Table Created with FeatureEngineeringClient

My Question:

Connect with Databricks Users in Your Area

Virtual Learning Festival: 9 April - 30 April

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Data + AI Summit 2025 — registration now open!

Databricks DevConnect: Global Community Meetups for Data Engineers

Databricks Community Champion - February 2025 - Stefan Koch