Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert

theron — Thu, 26 Dec 2024 15:55:31 GMT

Hi there!

I’d like to use Liquid Clustering in a Spark Streaming process with foreachBatch(upsert). However, I’m not sure of the correct approach.

The Databricks documentation suggests using .clusterBy(key) when writing streaming data. In my case, I'm using foreachBatch with a SQL query that performs a MERGE by a specific key or inserts all records if they don’t match.

Now that the table has been created with Liquid Clustering enabled, what is the right way to set this up? Should I use:

df.writeStream.clusterBy(key).foreachBatch(upsert_method)

Or just:

df.writeStream.foreachBatch(upsert_method)

Also, do I need to run OPTIMIZE FULL frequently, or is it run automatically during the streaming process?

I’m currently using Liquid Clustering with Spark Streaming and the foreachBatch clause, but I'm unsure how exactly to apply clusterBy(key). I saw some references in the documentation, but it's still not clear.

Re: Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert

RiyazAliM — Fri, 27 Dec 2024 03:06:35 GMT

Hi @theron

If you have enabled LC while table creation, you must have already specified the cluster column. Thus I don't see a reason to mention .clusterBy(key) again.

Let me know if any questions

Cheers!

If you want to create a brand new table with LC enabled or enable LC while writing to that table use the .clusterBy(key) method.

OPTIMIZE FULL should only be ran if you change the cluster keys or enabling clustering for the first time on a delta table. If you run OPTIMIZE FULL regularly, it's same as running OPTIMIZE.

Usually, predictive optimization automatically runs OPTIMIZE commands for all LC enabled tables. To trigger it manually, you can run the OPTIMIZE command explicitly.

topic Re: Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert in Data Engineering

Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert

Re: Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert