cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Getting concurrent issue on delta table using liquid clustering

Anand13
New Contributor II

In our project, we are testing liquid clustering using a test table called status_update, where we need to update the status for different market IDs. We are attempting to update the status_update table in parallel using the UPDATE command.

ALTER TABLE status_update CLUSTER BY (mkt_id) spark.sql(f"UPDATE status_update SET status='{status}' WHERE mkt_id={mkt_id}")

However, when running the notebook in parallel for different market IDs, we encounter a concurrency issue.

2 REPLIES 2

Walter_C
Databricks Employee
Databricks Employee

To address concurrency issues and optimize parallel updates in Spark SQL using the UPDATE command on the status_update table, consider the following strategies:

  1. Row-Level Logical Conflict Resolution: Tables with liquid clustering enabled in Databricks Runtime 13.3 and above support row-level concurrency. This minimizes transaction and clustering conflicts during operations like UPDATE, MERGE, and DELETE. For such tables, enabling row tracking can improve performance and reduce write conflicts related to row-level concurrency

  2. Mitigate Concurrent Updates: Concurrent updates can fail due to version mismatches of the Delta table. To reduce this, avoid running UPDATE commands concurrently. Alternatively, consider staging updates in intermediate tables to ensure atomicity, and prevent queries from scanning the target table multiple times during the update operation

  3. Optimize Liquid Clustering: Liquid clustering improves query performance by organizing data according to clustering keys derived from historical query patterns. It provides incremental clustering, which avoids unnecessary write amplification. Scheduling regular OPTIMIZE jobs (e.g., every one or two hours) can further enhance data layout for better performance during concurrent operations

  4. Cluster Configuration and Schema Settings: Use configurations such as:

    • spark.databricks.delta.merge.enableLowShuffle: Enables low-shuffle operations to retain existing data organization while efficiently updating and merging
    • delta.enableDeletionVectors: This configuration aids concurrent updates by enabling deletion vectors and row-level concurrency, reducing write conflicts
  5. Avoid Table Scans: To minimize concurrency issues, reduce the number of table scans during operations such as INSERT or UPDATE. Persist intermediate data separately to take the load off the target table and reduce read/write conflicts.

Anand13
New Contributor II

@Walter_C 

We are using Liquid Clustering as our first strategy. Our Databricks Runtime is 13.3, and we have a table named status_update containing approximately 30 market IDs, each with a single record. In our pipeline, if any market fails, we need to update the status of the table to 'failed'. We are updating the status in parallel, but we are encountering concurrency issues. Does this mean that Liquid Clustering does not work effectively when we use the UPDATE statement on that table?