cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert

theron
New Contributor

Hi there!

I’d like to use Liquid Clustering in a Spark Streaming process with foreachBatch(upsert). However, I’m not sure of the correct approach.

The Databricks documentation suggests using .clusterBy(key) when writing streaming data. In my case, I'm using foreachBatch with a SQL query that performs a MERGE by a specific key or inserts all records if they don’t match.

Now that the table has been created with Liquid Clustering enabled, what is the right way to set this up? Should I use:

 

 
df.writeStream.clusterBy(key).foreachBatch(upsert_method)

Or just:

df.writeStream.foreachBatch(upsert_method)

Also, do I need to run OPTIMIZE FULL frequently, or is it run automatically during the streaming process?

I’m currently using Liquid Clustering with Spark Streaming and the foreachBatch clause, but I'm unsure how exactly to apply clusterBy(key). I saw some references in the documentation, but it's still not clear.

1 REPLY 1

RiyazAli
Valued Contributor

Hi @theron 

If you have enabled LC while table creation, you must have already specified the cluster column. Thus I don't see a reason to mention  .clusterBy(key) again.

Let me know if any questions

Cheers!

If you want to create a brand new table with LC enabled or enable LC while writing to that table use the .clusterBy(key) method.

OPTIMIZE FULL should only be ran if you change the cluster keys or enabling clustering for the first time on a delta table. If you run OPTIMIZE FULL regularly, it's same as running OPTIMIZE.

Usually, predictive optimization automatically runs OPTIMIZE commands for all LC enabled tables. To trigger it manually, you can run the OPTIMIZE command explicitly.

Riz

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group