โ04-11-2025 02:39 AM - edited โ04-11-2025 02:40 AM
I can find documentation to enable automatic liquid clustering with SQL code: CLUSTER BY AUTO. But how do I do this with Pyspark? I know I can do it with spark.sql("ALTER TABLE CLUSTER BY AUTO") but ideally I want to pass it as an .option().
Thanks in advance.
โ04-11-2025 09:24 AM
To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:
1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.
2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable
DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.
3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.
Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.
โ04-11-2025 09:24 AM
To enable automatic liquid clustering with PySpark and pass it as an `.option()` during table creation or modification, you currently cannot directly use a `.clusterBy("AUTO")` method in PySpark's `DataFrameWriter` API. However, there are workarounds:
1. Using SQL via `spark.sql()`
The simplest way to enable automatic liquid clustering is by executing an SQL statement:
```python
spark.sql("ALTER TABLE table_name CLUSTER BY AUTO")
```
This enables automatic liquid clustering on an existing Delta table.
2. Using the DeltaTableBuilder API
If you're creating a new table programmatically, you can use the DeltaTableBuilder API in PySpark to specify clustering options:
```python
from delta.tables import DeltaTable
DeltaTable.create(spark) \
.tableName("table_name") \
.addColumn("col1", "STRING") \
.addColumn("col2", "INT") \
.property("delta.autoOptimize.optimizeWrite", "true") \
.property("delta.autoOptimize.autoCompact", "true") \
.property("delta.clusterBy.auto", "true") \
.execute()
```
Here, `.property("delta.clusterBy.auto", "true")` ensures that automatic liquid clustering is enabled.
3. Using `DataFrameWriterV2` for Table Creation
If you're creating a table from an existing DataFrame, you can use the `DataFrameWriterV2` API:
```python
df.writeTo("table_name") \
.using("delta") \
.option("clusterBy.auto", "true") \
.create()
```
This approach allows you to specify the `clusterBy.auto` option directly during the write operation.
Important Notes
- Automatic liquid clustering requires Databricks Runtime 15.4 LTS or higher.
- Ensure your table is managed by Unity Catalog if using automatic clustering.
- For existing tables, clustering does not apply retroactively to old data unless you run `OPTIMIZE FULL`.
2 weeks ago
This is supported now for DBR 16.4+ for both DataframeWriterV1 and DataframeWriterV2 APIs, and also for DLT, and DataStreaming APIs. More details are here: https://docs.databricks.com/aws/en/delta/clustering . Basically using the option, `.option("clusterBy.auto", "true")`
โ04-25-2025 05:35 AM
How about if i use
โ04-25-2025 08:28 AM
Not at the moment. You have to use the SQL DDL commands either at table creation or via alter table command. Hope this help, Louis.
2 weeks ago
This is supported now for DBR 16.4+ for both DataframeWriterV1 and DataframeWriterV2 APIs, and also for DLT, and DataStreaming APIs. More details are here: https://docs.databricks.com/aws/en/delta/clustering . Basically using the option, `.option("clusterBy.auto", "true")`
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now