topic Re: Cluster by auto pyspark in Get Started Discussions

Cluster by auto pyspark

htd350 — Fri, 11 Apr 2025 09:40:13 GMT

I can find documentation to enable automatic liquid clustering with SQL code: CLUSTER BY AUTO. But how do I do this with Pyspark? I know I can do it with spark.sql("ALTER TABLE CLUSTER BY AUTO") but ideally I want to pass it as an .option().

Thanks in advance.

Re: Cluster by auto pyspark

Louis_Frolio — Fri, 11 Apr 2025 16:24:37 GMT

To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:

1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.

2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable

DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.

3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.

Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.

Re: Cluster by auto pyspark

claudiayuan — Fri, 25 Apr 2025 12:35:34 GMT

How about if i use

@Dlt.table

is it possible to configure the automatic liquid clustering in the

table_properties?

Re: Cluster by auto pyspark

Louis_Frolio — Fri, 25 Apr 2025 15:28:06 GMT

Not at the moment. You have to use the SQL DDL commands either at table creation or via alter table command. Hope this help, Louis.

Re: Cluster by auto pyspark

parimarjan — Fri, 07 Nov 2025 20:42:34 GMT

This is supported now for DBR 16.4+ for both DataframeWriterV1 and DataframeWriterV2 APIs, and also for DLT, and DataStreaming APIs. More details are here: https://docs.databricks.com/aws/en/delta/clustering . Basically using the option, `.option("clusterBy.auto", "true")`

Re: Cluster by auto pyspark

parimarjan — Fri, 07 Nov 2025 20:43:21 GMT