cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cluster by auto pyspark

htd350
New Contributor II

I can find documentation to enable automatic liquid clustering with SQL code: CLUSTER BY AUTO. But how do I do this with Pyspark? I know I can do it with spark.sql("ALTER TABLE CLUSTER BY AUTO") but ideally I want to pass it as an .option().

Thanks in advance.

1 ACCEPTED SOLUTION

Accepted Solutions

BigRoux
Databricks Employee
Databricks Employee

To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:

1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.

2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable

DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.

3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.

Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.

View solution in original post

🔥[New Video] In this Databricks episode, I demonstrated how to implement modern feature of Delta Lake in Databricks - Liquid Clustering on delta table in Databricks. If you like this video, give it a thumps up, share with your connection and subscribe to the channel for more content. See you in ...
1 REPLY 1

BigRoux
Databricks Employee
Databricks Employee

To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:

1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.

2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable

DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.

3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.

Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.

🔥[New Video] In this Databricks episode, I demonstrated how to implement modern feature of Delta Lake in Databricks - Liquid Clustering on delta table in Databricks. If you like this video, give it a thumps up, share with your connection and subscribe to the channel for more content. See you in ...

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now