I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
I am running jobs on Databricks using the Run Submit API with Airflow. I have noticed that rarely, a particular run is run more than one time at once. Why?
For Structured Streaming Applications, this would be a nice feature.
Delta Live Tables manages checkpoints for you out of the box - you don't even have to reason about checkpoints at all, would recommend checking it out!
Borrowed from LinkedIn, here is a SQL query you can use to compare two tables (or dataframes)
with
hash_src as ( select hash(*) as hash_val from my.source.table ),
hash_tgt as ( select hash(*) as hash_val from my.target.table )
select sum(hash_val) ...
See this chart for a description of how row-level concurrency works. With row-level concurrency, concurrent merge operations still cannot safely modify the same row. Without row-level concurrency, concurrenty merge operations cannot safely modify the...
Asynchronous progress tracking is a feature designed for ultra low latency use cases. You can read more in the open source SPIP doc here, but the expected gain in time is in the hundreds of milliseconds, which seems insignificant when doing merge ope...