Liquid Clustering - Implementing with Spark Streaming’s foreachBatch Upsert
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2024 07:51 AM - edited 12-26-2024 07:55 AM
Hi there!
I’d like to use Liquid Clustering in a Spark Streaming process with foreachBatch(upsert). However, I’m not sure of the correct approach.
The Databricks documentation suggests using .clusterBy(key) when writing streaming data. In my case, I'm using foreachBatch with a SQL query that performs a MERGE by a specific key or inserts all records if they don’t match.
Now that the table has been created with Liquid Clustering enabled, what is the right way to set this up? Should I use:
df.writeStream.clusterBy(key).foreachBatch(upsert_method)
Or just:
df.writeStream.foreachBatch(upsert_method)
Also, do I need to run OPTIMIZE FULL frequently, or is it run automatically during the streaming process?
I’m currently using Liquid Clustering with Spark Streaming and the foreachBatch clause, but I'm unsure how exactly to apply clusterBy(key). I saw some references in the documentation, but it's still not clear.
- Labels:
-
Spark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2024 07:06 PM
Hi @theron
If you have enabled LC while table creation, you must have already specified the cluster column. Thus I don't see a reason to mention .clusterBy(key) again.
Let me know if any questions
Cheers!
If you want to create a brand new table with LC enabled or enable LC while writing to that table use the .clusterBy(key) method.
OPTIMIZE FULL should only be ran if you change the cluster keys or enabling clustering for the first time on a delta table. If you run OPTIMIZE FULL regularly, it's same as running OPTIMIZE.
Usually, predictive optimization automatically runs OPTIMIZE commands for all LC enabled tables. To trigger it manually, you can run the OPTIMIZE command explicitly.

