Hi there!
I’d like to use Liquid Clustering in a Spark Streaming process with foreachBatch(upsert). However, I’m not sure of the correct approach.
The Databricks documentation suggests using .clusterBy(key) when writing streaming data. In my case, I'm using foreachBatch with a SQL query that performs a MERGE by a specific key or inserts all records if they don’t match.
Now that the table has been created with Liquid Clustering enabled, what is the right way to set this up? Should I use:
df.writeStream.clusterBy(key).foreachBatch(upsert_method)
Or just:
df.writeStream.foreachBatch(upsert_method)
Also, do I need to run OPTIMIZE FULL frequently, or is it run automatically during the streaming process?
I’m currently using Liquid Clustering with Spark Streaming and the foreachBatch clause, but I'm unsure how exactly to apply clusterBy(key). I saw some references in the documentation, but it's still not clear.