Databricks Community

sriram_kumar · ‎06-06-2023

Hi Team,

We have few prod tables which are created in s3 bucket, that have grown now very large, these tables are getting real time data continuously from round the clock databricks workflows; we would like run the optimization commands(optimize, zorderby), without stopping/pausing the jobs, Could you please suggest if there is a way to accomplish this?

Thanks in advance !!

Sandeep · ‎06-07-2023

@Sriram Kumar Are they just inserts? Then you can optimize it without affecting it: https://docs.databricks.com/optimizations/isolation-level.html#write-conflicts-on-databricks.

Vartika · ‎06-09-2023

Hi @Sriram Kumar,

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!

Anonymous · ‎06-13-2023

@Sriram Kumar :

To run optimization commands like OPTIMIZE and ZORDER BY on large tables in an S3 bucket without stopping or pausing the Databricks workflows that continuously update the tables, you can follow these steps:

Create a copy of the table: Since you can't perform optimization commands directly on a table that is actively being updated, you can create a copy of the table with a different name. This will allow you to optimize the copy while the original table continues to receive real-time data.
Set up a process to keep the copy synchronized: You'll need to establish a process to keep the copy table synchronized with the original table. This can be achieved using techniques like incremental data updates or periodic synchronization jobs, depending on your specific requirements and the data update patterns.
Optimize the copy table: Once you have a synchronized copy of the table, you can run the optimization commands (OPTIMIZE and ZORDER BY) on the copy table without affecting the continuous data updates on the original table.
Swap the tables: After optimizing the copy table, you can perform a table swap operation to switch the original table with the optimized copy. This can be done by renaming the tables or updating the table pointers, depending on your specific setup.

Hope this helps!

Anonymous · ‎06-14-2023

Hi @Sriram Kumar

We haven't heard from you since the last response from @Suteja Kanuri . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards