topic Re: Getting concurrent issue on delta table using liquid clustering in Data Engineering

Getting concurrent issue on delta table using liquid clustering

Anand13 — Mon, 02 Jun 2025 12:12:32 GMT

In our project, we are testing liquid clustering using a test table called status_update, where we need to update the status for different market IDs. We are attempting to update the status_update table in parallel using the UPDATE command.

ALTER TABLE status_update CLUSTER BY (mkt_id) spark.sql(f"UPDATE status_update SET status='{status}' WHERE mkt_id={mkt_id}")

However, when running the notebook in parallel for different market IDs, we encounter a concurrency issue.

Re: Getting concurrent issue on delta table using liquid clustering

Walter_C — Mon, 02 Jun 2025 16:38:08 GMT

To address concurrency issues and optimize parallel updates in Spark SQL using the UPDATE command on the status_update table, consider the following strategies:

Row-Level Logical Conflict Resolution: Tables with liquid clustering enabled in Databricks Runtime 13.3 and above support row-level concurrency. This minimizes transaction and clustering conflicts during operations like UPDATE, MERGE, and DELETE. For such tables, enabling row tracking can improve performance and reduce write conflicts related to row-level concurrency
Mitigate Concurrent Updates: Concurrent updates can fail due to version mismatches of the Delta table. To reduce this, avoid running UPDATE commands concurrently. Alternatively, consider staging updates in intermediate tables to ensure atomicity, and prevent queries from scanning the target table multiple times during the update operation
Optimize Liquid Clustering: Liquid clustering improves query performance by organizing data according to clustering keys derived from historical query patterns. It provides incremental clustering, which avoids unnecessary write amplification. Scheduling regular OPTIMIZE jobs (e.g., every one or two hours) can further enhance data layout for better performance during concurrent operations
Cluster Configuration and Schema Settings: Use configurations such as:
- spark.databricks.delta.merge.enableLowShuffle: Enables low-shuffle operations to retain existing data organization while efficiently updating and merging
- delta.enableDeletionVectors: This configuration aids concurrent updates by enabling deletion vectors and row-level concurrency, reducing write conflicts
Avoid Table Scans: To minimize concurrency issues, reduce the number of table scans during operations such as INSERT or UPDATE. Persist intermediate data separately to take the load off the target table and reduce read/write conflicts.

Re: Getting concurrent issue on delta table using liquid clustering

Anand13 — Wed, 11 Jun 2025 11:02:10 GMT

@Walter_C

We are using Liquid Clustering as our first strategy. Our Databricks Runtime is 13.3, and we have a table named status_update containing approximately 30 market IDs, each with a single record. In our pipeline, if any market fails, we need to update the status of the table to 'failed'. We are updating the status in parallel, but we are encountering concurrency issues. Does this mean that Liquid Clustering does not work effectively when we use the UPDATE statement on that table?