cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

To do Optimization on the real time delta table

sriram_kumar
New Contributor II

Hi Team,

We have few prod tables which are created in s3 bucket, that have grown now very large, these tables are getting real time data continuously from round the clock databricks workflows; we would like run the optimization commands(optimize, zorderby), without stopping/pausing the jobs, Could you please suggest if there is a way to accomplish this? 

Thanks in advance !!

4 REPLIES 4

Sandeep
Contributor III

@Sriram Kumarโ€‹ Are they just inserts? Then you can optimize it without affecting it: https://docs.databricks.com/optimizations/isolation-level.html#write-conflicts-on-databricks.

Vartika
Databricks Employee
Databricks Employee

Hi @Sriram Kumarโ€‹,

Hope all is well!

Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Anonymous
Not applicable

@Sriram Kumarโ€‹ :

To run optimization commands like OPTIMIZE and ZORDER BY on large tables in an S3 bucket without stopping or pausing the Databricks workflows that continuously update the tables, you can follow these steps:

  1. Create a copy of the table: Since you can't perform optimization commands directly on a table that is actively being updated, you can create a copy of the table with a different name. This will allow you to optimize the copy while the original table continues to receive real-time data.
  2. Set up a process to keep the copy synchronized: You'll need to establish a process to keep the copy table synchronized with the original table. This can be achieved using techniques like incremental data updates or periodic synchronization jobs, depending on your specific requirements and the data update patterns.
  3. Optimize the copy table: Once you have a synchronized copy of the table, you can run the optimization commands (OPTIMIZE and ZORDER BY) on the copy table without affecting the continuous data updates on the original table.
  4. Swap the tables: After optimizing the copy table, you can perform a table swap operation to switch the original table with the optimized copy. This can be done by renaming the tables or updating the table pointers, depending on your specific setup.

Hope this helps!

Anonymous
Not applicable

Hi @Sriram Kumarโ€‹ 

We haven't heard from you since the last response from @Suteja Kanuriโ€‹ โ€‹ . Kindly share the information with us, and in return, we will provide you with the necessary solution.

Thanks and Regards

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group