Databricks Community

htd350 · 2 weeks ago

I can find documentation to enable automatic liquid clustering with SQL code: CLUSTER BY AUTO. But how do I do this with Pyspark? I know I can do it with spark.sql("ALTER TABLE CLUSTER BY AUTO") but ideally I want to pass it as an .option().

Thanks in advance.

BigRoux · 2 weeks ago

To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:

1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.

2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable

DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.

3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.

Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.

View solution in original post

BigRoux · 2 weeks ago

To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:

1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.

2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable

DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.

3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.

Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.

Databricks Community

Cluster by auto pyspark

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!