Databricks Community

theron · yesterday

Hi there!

I’d like to use Liquid Clustering in a Spark Streaming process with foreachBatch(upsert). However, I’m not sure of the correct approach.

The Databricks documentation suggests using .clusterBy(key) when writing streaming data. In my case, I'm using foreachBatch with a SQL query that performs a MERGE by a specific key or inserts all records if they don’t match.

Now that the table has been created with Liquid Clustering enabled, what is the right way to set this up? Should I use:

df.writeStream.clusterBy(key).foreachBatch(upsert_method)

Or just:

df.writeStream.foreachBatch(upsert_method)

Also, do I need to run OPTIMIZE FULL frequently, or is it run automatically during the streaming process?

I’m currently using Liquid Clustering with Spark Streaming and the foreachBatch clause, but I'm unsure how exactly to apply clusterBy(key). I saw some references in the documentation, but it's still not clear.

RiyazAli · yesterday

Hi @theron

If you have enabled LC while table creation, you must have already specified the cluster column. Thus I don't see a reason to mention .clusterBy(key) again.

Let me know if any questions

Cheers!

If you want to create a brand new table with LC enabled or enable LC while writing to that table use the .clusterBy(key) method.

OPTIMIZE FULL should only be ran if you change the cluster keys or enabling clustering for the first time on a delta table. If you run OPTIMIZE FULL regularly, it's same as running OPTIMIZE.

Usually, predictive optimization automatically runs OPTIMIZE commands for all LC enabled tables. To trigger it manually, you can run the OPTIMIZE command explicitly.

Riz

Databricks Community

Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon