- 558 Views
- 1 replies
- 0 kudos
- 0 kudos
If the data in your table is huge, try to combine OPTIMIZE with WHERE so you only perform OPTIMIZE on a subset of the data rather than all data. see documentation here.
- 0 kudos
If the data in your table is huge, try to combine OPTIMIZE with WHERE so you only perform OPTIMIZE on a subset of the data rather than all data. see documentation here.
I have a table that I need to be continuously streaming into. I know it's best practice to run Optimize on my tables periodically. But if I never stop writing to the table, how and when can I run OPTIMIZE against it?
If the streaming job is making bling appends to the delta table, then it's perfectly fine to run OPTIMIZE query in parallel.However, if the streaming job is performing MERGE or UPDATE then it can conflict with the OPTIMIZE operations. In such cases w...
I have a daily OPTIMIZE job running, however, the number of files in the storage is not going down. Looks like the optimize is not helping to reduce the files.
The files are not physically removed from the Storage by the optimize command. A VACUUM command has to be executed to achieve the same
I have a master delta table that is continuously getting written by a streaming job. I have optimize writes enabled and in addition, I run the OPTIMIZE command every 3 hours. However, I think the downstream streaming jobs which are streaming the data...
This is working as expected. For Delta streaming, the data files created in the first place will be used for streaming. The optimized files are not considered the downstream streaming job. This is the reason it's not recommended to run VACUUM with f...
I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?
From my Data+AI talk on Operating and Supporting Delta lake in production
Its typically a good idea to run optimize aligned with the frequency of updates to the Delta Table. However you also don't want to over do as there's a cost/performance trade-off. Unless there are very frequent updates to the table that can cause sma...
Optimize merges small files into larger ones and can involve shuffling and creation of large in-memory partitions. Thus its recommended to use a memory optimized executor configuration to prevent spilling to disk. IN additional use of autoscaling wil...
If I run OPTIMIZE on a Delta Lake table, will it prevent me from time travelling to a version before OPTIMIZE was run?
No, you will still be able to time travel to versions previous to the OPTIMIZE command. OPTIMIZE is just another transaction like MERGE, UPDATE, etc. Check out these docs to learn more about retention periods and the VACUUM command.
are there any other parameters to consider running OPTIMIZE depending cloud vendor?
The optimize is not dependent on the cloud provider whatsoever. Optimize will produce the same results regardless of the underlying storage. It is idempotent, meaning if it is run twice on the same dataset the the second execution has no effect.
Hi everyone. I am trying to learn the keyword OPTIMIZE from this blog using scala: https://docs.databricks.com/delta/optimizations/optimization-examples.html#delta-lake-on-databricks-optimizations-scala-notebook. But my local spark seems not able t...
Hi Jigao, OPTIMIZE isn't in the open source delta API, so won't run on your local Spark instance - https://docs.delta.io/latest/api/scala/io/delta/tables/index.html?search=optimize