cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

To do Optimization on the real time delta table

sriram_kumar
New Contributor II

Hi Team,

We have few prod tables which are created in s3 bucket, that have grown now very large, these tables are getting real time data continuously from round the clock databricks workflows; we would like run the optimization commands(optimize, zorderby), without stopping/pausing the jobs, Could you please suggest if there is a way to accomplish this? 

Thanks in advance !!

4 REPLIES 4

Sandeep
Contributor III

@Sriram Kumar​ Are they just inserts? Then you can optimize it without affecting it: https://docs.databricks.com/optimizations/isolation-level.html#write-conflicts-on-databricks.

Vartika
Moderator
Moderator

Hi @Sriram Kumar​,

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Anonymous
Not applicable

@Sriram Kumar​ :

To run optimization commands like OPTIMIZE and ZORDER BY on large tables in an S3 bucket without stopping or pausing the Databricks workflows that continuously update the tables, you can follow these steps:

  1. Create a copy of the table: Since you can't perform optimization commands directly on a table that is actively being updated, you can create a copy of the table with a different name. This will allow you to optimize the copy while the original table continues to receive real-time data.
  2. Set up a process to keep the copy synchronized: You'll need to establish a process to keep the copy table synchronized with the original table. This can be achieved using techniques like incremental data updates or periodic synchronization jobs, depending on your specific requirements and the data update patterns.
  3. Optimize the copy table: Once you have a synchronized copy of the table, you can run the optimization commands (OPTIMIZE and ZORDER BY) on the copy table without affecting the continuous data updates on the original table.
  4. Swap the tables: After optimizing the copy table, you can perform a table swap operation to switch the original table with the optimized copy. This can be done by renaming the tables or updating the table pointers, depending on your specific setup.

Hope this helps!

Anonymous
Not applicable

Hi @Sriram Kumar​ 

We haven't heard from you since the last response from @Suteja Kanuri​ ​ . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!