Data Engineering

Forum Posts

Sorted by:

by cgrant • Databricks Employee

06-23-2021 2:28:51 PM

13861 Views
3 replies
4 kudos

What is the difference between OPTIMIZE and Auto Optimize?

I see that Delta Lake has an OPTIMIZE command and also table properties for Auto Optimize. What are the differences between these and when should I use one over the other?

Data Engineering

13861 Views
3 replies
4 kudos

06-23-2021 2:28:51 PM

View Replies

Latest Reply

basit
New Contributor II

2 weeks ago

4 kudos

Is this still valid answer in 2025 ? https://docs.databricks.com/aws/en/delta/tune-file-size#auto-compaction-for-delta-lake-on-databricks

4 kudos

2 weeks ago

2 More Replies

by Hubert-Dudek • Esteemed Contributor III

11-03-2021 9:26:02 AM

13618 Views
6 replies
17 kudos

Resolved! Optimize and Vacuum - which is the best order of operations?

Optimize -> VacuumorVacuum -> Optimize

Data Engineering

13618 Views
6 replies
17 kudos

11-03-2021 9:26:02 AM

View Replies

Latest Reply

shadowinc
New Contributor III

11-26-2024 4:04:15 AM

17 kudos

What about ReOrg delta table https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/delta-reorg-tableDoes it help or make sense to add Re-org then Optimize - Vacuum every week?Reorganize a Delta Lake table by rewriting files to purge ...

17 kudos

11-26-2024 4:04:15 AM

5 More Replies

by ros • New Contributor III

05-31-2023 12:47:59 AM

2260 Views
2 replies
2 kudos

merge vs MERGE INTO

from 10.4 LTS version we have low shuffle merge, so merge is more faster. But what about MERGE INTO function that we run in sql notebook of databricks. Is there any performance difference when we use databrciks pyspark ".merge" function vs databricks...

Data Engineering

2260 Views
2 replies
2 kudos

05-31-2023 12:47:59 AM

View Replies

Latest Reply

Anonymous
Not applicable

06-01-2023 12:10:35 AM

2 kudos

Hi @Roshan RC Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

2 kudos

06-01-2023 12:10:35 AM

1 More Replies

by Erik_L • Contributor II

04-20-2023 4:22:59 PM

2722 Views
3 replies
1 kudos

Resolved! How to keep data in time-based localized clusters after joining?

I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer inde...

Data Engineering

2722 Views
3 replies
1 kudos

04-20-2023 4:22:59 PM

View Replies

Latest Reply

Anonymous
Not applicable

04-20-2023 7:16:25 PM

1 kudos

@Erik Louie :If the data frames have different time zones, you can use Databricks' timezone conversion function to convert them to a common time zone. You can use the from_utc_timestamp or to_utc_timestampfunction to convert the timestamp column to ...

1 kudos

04-20-2023 7:16:25 PM

2 More Replies

by alejandrofm • Valued Contributor

04-20-2023 5:44:19 AM

4115 Views
2 replies
2 kudos

Resolved! Lot of write shuffle on optimize + ZORDER, is it normal?

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context: Could I be missing something...

Data Engineering

4115 Views
2 replies
2 kudos

04-20-2023 5:44:19 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-23-2023 8:05:20 AM

2 kudos

Hi @Alejandro Martinez Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best an...

2 kudos

04-23-2023 8:05:20 AM

1 More Replies

by Dean_Lovelace • New Contributor III

04-17-2023 12:55:08 AM

4990 Views
3 replies
0 kudos

Delta Table Optimize Error

I have have started getting an error message when running the following optimize command:-deltaTable.optimize().executeCompaction()Error:-java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Number of records changed after Optimi...

Data Engineering

4990 Views
3 replies
0 kudos

04-17-2023 12:55:08 AM

View Replies

Latest Reply

Anonymous
Not applicable

04-18-2023 2:06:28 AM

0 kudos

@Dean Lovelace :The error message suggests that the number of records in the Delta table changed after the optimize() command was run. The optimize() command is used to improve the performance of Delta tables by removing small files and compacting l...

0 kudos

04-18-2023 2:06:28 AM

2 More Replies

by thushar • Contributor

04-03-2023 6:10:25 AM

9199 Views
5 replies
0 kudos

Optimize & Compaction

Hi,From which data bricks runtime will support Optimize and compaction

Data Engineering

9199 Views
5 replies
0 kudos

04-03-2023 6:10:25 AM

View Replies

Latest Reply

Joe_Suarez
New Contributor III

04-07-2023 5:35:58 AM

0 kudos

Optimize and compaction are operations commonly used in Apache Spark for optimizing and improving the performance of data storage and processing. Databricks, which is a cloud-based platform for Apache Spark, provides support for these operations on v...

0 kudos

04-07-2023 5:35:58 AM

4 More Replies

by MaximS • New Contributor

12-16-2022 7:25:35 AM

1599 Views
1 replies
1 kudos

OPTIMIZE command failed to complete on partitioned dataset

Trying to optimize delta table with following stats:size: 212,848 blobs, 31,162,417,246,985 bytescommand: OPTIMIZE <table> ZORDER BY (X, Y, Z)In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But ...

Data Engineering

1599 Views
1 replies
1 kudos

12-16-2022 7:25:35 AM

View Replies

Latest Reply

Aviral-Bhardwaj
Esteemed Contributor III

12-17-2022 10:49:34 PM

1 kudos

can you share some sample datasets for this by that we can debug and help you accordingly ThanksAviral

1 kudos

12-17-2022 10:49:34 PM

by Bartek • Contributor

11-26-2022 3:11:34 PM

7697 Views
3 replies
9 kudos

Resolved! Number of partitions in Spark UI Simulator experiment

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1596 about data skew and in command 2 there is comment about how many partitions will be set as default:// Factor of 8 cores and greater ...

Data Engineering

7697 Views
3 replies
9 kudos

11-26-2022 3:11:34 PM

View Replies

Latest Reply

UmaMahesh1
Honored Contributor III

11-27-2022 10:20:28 AM

9 kudos

Hi @Bartosz Maciejewski Generally we arrive at the number of shuffle partitions using the following method.Input Size Data - 100 GBIdeal partition target size - 128 MBCores - 8Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804To utiltize the...

9 kudos

11-27-2022 10:20:28 AM

2 More Replies

by cristianc • Contributor

04-05-2022 6:13:02 AM

2890 Views
5 replies
3 kudos

Is it required to run OPTIMIZE after doing GDPR DELETEs?

Greetings,I have been reading the excellent article from https://docs.databricks.com/security/privacy/gdpr-delta.html?_ga=2.130942095.1400636634.1649068106-1416403472.1644480995&_gac=1.24792648.1647880283.CjwKCAjwxOCRBhA8EiwA0X8hi4Jsx2PulVs_FGMBdByBk...

Data Engineering

2890 Views
5 replies
3 kudos

04-05-2022 6:13:02 AM

View Replies

Latest Reply

cristianc
Contributor

04-05-2022 6:16:55 AM

3 kudos

@Hubert Dudek thanks for the hint, exactly as written in the article VACUUM is required after the GDPR delete operation, however do we need to OPTIMIZE ZSORT again the table or is the ordering maintained?

3 kudos

04-05-2022 6:16:55 AM

4 More Replies

by alejandrofm • Valued Contributor

02-12-2022 1:30:44 PM

2753 Views
3 replies
1 kudos

Resolved! Recommendations to execute OPTIMIZE on tables

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it ...

Data Engineering

2753 Views
3 replies
1 kudos

02-12-2022 1:30:44 PM

View Replies

Latest Reply

Anonymous
Not applicable

02-17-2022 8:24:19 AM

1 kudos

@Alejandro Martinez - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

1 kudos

02-17-2022 8:24:19 AM

2 More Replies

by Constantine • Contributor III

02-01-2022 6:58:10 PM

4350 Views
1 replies
2 kudos

Resolved! OPTIMIZE throws an error after doing MERGE on the table

I have a table on which I do upsert i.e. MERGE INTO table_name ...After which I run OPTIMIZE table_nameWhich throws an errorjava.util.concurrent.ExecutionException: io.delta.exceptions.ConcurrentDeleteReadException: This transaction attempted to read...

Data Engineering

4350 Views
1 replies
2 kudos

02-01-2022 6:58:10 PM

View Replies

Latest Reply

Hubert-Dudek
Esteemed Contributor III

02-02-2022 6:37:00 AM

2 kudos

You can try to change isolation level:https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/isolation-levelIn merge is good to specify all partitions in merge conditions.It can also happen when script is running concurrently.

2 kudos

02-02-2022 6:37:00 AM

by guruv • New Contributor III

11-30-2021 10:03:59 PM

5444 Views
4 replies
2 kudos

Resolved! delta table autooptimize vs optimize command

HI,i have several delta tables on Azure adls gen 2 storage account running databricks runtime 7.3. there are only write/read operation on delta tables and no update/delete.As part of release pipeline, below commands are executed in a new notebook in...

Data Engineering

5444 Views
4 replies
2 kudos

11-30-2021 10:03:59 PM

View Replies

Latest Reply

-werners-
Esteemed Contributor III

12-01-2021 2:13:41 AM

2 kudos

the auto optimize is sufficient, unless you run into performance issues.Then I would trigger an optimize. This will generate files of 1GB (so larger than the standard size of auto optimize). And of course the Z-Order if necessary.The suggestion to ...

2 kudos

12-01-2021 2:13:41 AM

3 More Replies

by Constantine • Contributor III

11-13-2021 9:40:42 AM

10763 Views
4 replies
4 kudos

Resolved! How does Spark do lazy evaluation?

For context, I am running Spark on databricks platform and using Delta Tables (s3). Let's assume we a table called table_one. I create a view called view_one using the table and then call view_one. Next, I create another view, called view_two based o...

Data Engineering

10763 Views
4 replies
4 kudos

11-13-2021 9:40:42 AM

View Replies

Latest Reply

jose_gonzalez
Databricks Employee

11-15-2021 11:08:43 AM

4 kudos

Hi @John Constantine ,The following notebook url will help you to undertand better the difference between lazy transformations and action in Spark. You will be able to compare the physical query plans and undertand better what is going on when you e...

4 kudos

11-15-2021 11:08:43 AM

3 More Replies

by Anonymous • Not applicable

11-10-2021 6:03:54 PM

2254 Views
2 replies
2 kudos

Resolved! OPTIMIZE

I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Pl...

Data Engineering

2254 Views
2 replies
2 kudos

11-10-2021 6:03:54 PM

View Replies

Latest Reply

Anonymous
Not applicable

11-11-2021 9:52:06 PM

2 kudos

@Werner Stinckens Thanks for your explanation.

2 kudos

11-11-2021 9:52:06 PM

1 More Replies

Databricks Community

What is the difference between OPTIMIZE and Auto Optimize?

Resolved! Optimize and Vacuum - which is the best order of operations?

merge vs MERGE INTO

Resolved! How to keep data in time-based localized clusters after joining?

Resolved! Lot of write shuffle on optimize + ZORDER, is it normal?

Delta Table Optimize Error

Optimize & Compaction

OPTIMIZE command failed to complete on partitioned dataset

Resolved! Number of partitions in Spark UI Simulator experiment

Is it required to run OPTIMIZE after doing GDPR DELETEs?

Resolved! Recommendations to execute OPTIMIZE on tables

Resolved! OPTIMIZE throws an error after doing MERGE on the table

Resolved! delta table autooptimize vs optimize command

Resolved! How does Spark do lazy evaluation?

Resolved! OPTIMIZE