cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

ros
by New Contributor III
  • 1604 Views
  • 2 replies
  • 2 kudos

merge vs MERGE INTO

from 10.4 LTS version we have low shuffle merge, so merge is more faster. But what about MERGE INTO function that we run in sql notebook of databricks. Is there any performance difference when we use databrciks pyspark ".merge" function vs databricks...

  • 1604 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Roshan RC​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers you...

  • 2 kudos
1 More Replies
Erik_L
by Contributor II
  • 2284 Views
  • 3 replies
  • 1 kudos

Resolved! How to keep data in time-based localized clusters after joining?

I have a bunch of data frames from different data sources. They are all time series data in order of a column timestamp, which is an int32 Unix timestamp. I can join them together by this and another column join_idx which is basically an integer inde...

  • 2284 Views
  • 3 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Erik Louie​ :If the data frames have different time zones, you can use Databricks' timezone conversion function to convert them to a common time zone. You can use the from_utc_timestamp or to_utc_timestampfunction to convert the timestamp column to ...

  • 1 kudos
2 More Replies
alejandrofm
by Valued Contributor
  • 3158 Views
  • 2 replies
  • 2 kudos

Resolved! Lot of write shuffle on optimize + ZORDER, is it normal?

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context: Could I be missing something...

image image.png image
  • 3158 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Alejandro Martinez​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best an...

  • 2 kudos
1 More Replies
Dean_Lovelace
by New Contributor III
  • 4257 Views
  • 3 replies
  • 0 kudos

Delta Table Optimize Error

I have have started getting an error message when running the following optimize command:-deltaTable.optimize().executeCompaction()Error:-java.util.concurrent.ExecutionException: java.lang.IllegalStateException: Number of records changed after Optimi...

  • 4257 Views
  • 3 replies
  • 0 kudos
Latest Reply
Anonymous
Not applicable
  • 0 kudos

@Dean Lovelace​ :The error message suggests that the number of records in the Delta table changed after the optimize() command was run. The optimize() command is used to improve the performance of Delta tables by removing small files and compacting l...

  • 0 kudos
2 More Replies
thushar
by Contributor
  • 8490 Views
  • 5 replies
  • 0 kudos

Optimize & Compaction

Hi,From which data bricks runtime will support Optimize and compaction

  • 8490 Views
  • 5 replies
  • 0 kudos
Latest Reply
Joe_Suarez
New Contributor III
  • 0 kudos

Optimize and compaction are operations commonly used in Apache Spark for optimizing and improving the performance of data storage and processing. Databricks, which is a cloud-based platform for Apache Spark, provides support for these operations on v...

  • 0 kudos
4 More Replies
MaximS
by New Contributor
  • 1355 Views
  • 1 replies
  • 1 kudos

OPTIMIZE command failed to complete on partitioned dataset

Trying to optimize delta table with following stats:size: 212,848 blobs, 31,162,417,246,985 bytescommand: OPTIMIZE <table> ZORDER BY (X, Y, Z)In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But ...

  • 1355 Views
  • 1 replies
  • 1 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 1 kudos

can you share some sample datasets for this by that we can debug and help you accordingly ThanksAviral

  • 1 kudos
Bartek
by Contributor
  • 5678 Views
  • 3 replies
  • 9 kudos

Resolved! Number of partitions in Spark UI Simulator experiment

I am learning how to optimize Spark applications with experiments from Spark UI Simulator. There is experiment #1​596 about data skew and in command 2 there is comment about how many partitions will be set as default:// Factor of 8 cores and greater ...

obraz
  • 5678 Views
  • 3 replies
  • 9 kudos
Latest Reply
UmaMahesh1
Honored Contributor III
  • 9 kudos

Hi @Bartosz Maciejewski​ Generally we arrive at the number of shuffle partitions using the following method.Input Size Data - 100 GBIdeal partition target size - 128 MBCores - 8Ideal number of partitions = (100*1028)/128 = 803.25 ~ 804To utiltize the...

  • 9 kudos
2 More Replies
cristianc
by Contributor
  • 2436 Views
  • 5 replies
  • 3 kudos

Is it required to run OPTIMIZE after doing GDPR DELETEs?

Greetings,I have been reading the excellent article from https://docs.databricks.com/security/privacy/gdpr-delta.html?_ga=2.130942095.1400636634.1649068106-1416403472.1644480995&_gac=1.24792648.1647880283.CjwKCAjwxOCRBhA8EiwA0X8hi4Jsx2PulVs_FGMBdByBk...

  • 2436 Views
  • 5 replies
  • 3 kudos
Latest Reply
cristianc
Contributor
  • 3 kudos

@Hubert Dudek​ thanks for the hint, exactly as written in the article VACUUM is required after the GDPR delete operation, however do we need to OPTIMIZE ZSORT again the table or is the ordering maintained?

  • 3 kudos
4 More Replies
alejandrofm
by Valued Contributor
  • 2190 Views
  • 3 replies
  • 1 kudos

Resolved! Recommendations to execute OPTIMIZE on tables

Hi, have Databricks running on AWS, I'm looking for a way to know when is a good time to run optimize on partitioned tables. Taking into account that it's an expensive process, especially on big tables, how could I know if it's a good time to run it ...

  • 2190 Views
  • 3 replies
  • 1 kudos
Latest Reply
Anonymous
Not applicable
  • 1 kudos

@Alejandro Martinez​ - If Jose's answer resolved your question, would you be happy to mark his answer as best? That helps other members find the answer more quickly.

  • 1 kudos
2 More Replies
Constantine
by Contributor III
  • 3679 Views
  • 1 replies
  • 2 kudos

Resolved! OPTIMIZE throws an error after doing MERGE on the table

I have a table on which I do upsert i.e. MERGE INTO table_name ...After which I run OPTIMIZE table_nameWhich throws an errorjava.util.concurrent.ExecutionException: io.delta.exceptions.ConcurrentDeleteReadException: This transaction attempted to read...

  • 3679 Views
  • 1 replies
  • 2 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 2 kudos

You can try to change isolation level:https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/isolation-levelIn merge is good to specify all partitions in merge conditions.It can also happen when script is running concurrently.

  • 2 kudos
guruv
by New Contributor III
  • 4406 Views
  • 4 replies
  • 2 kudos

Resolved! delta table autooptimize vs optimize command

HI,i have several delta tables on Azure adls gen 2 storage account running databricks runtime 7.3. there are only write/read operation on delta tables and no update/delete.As part of release pipeline, below commands are executed in a new notebook in...

  • 4406 Views
  • 4 replies
  • 2 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 2 kudos

the auto optimize is sufficient, unless you run into performance issues.Then I would trigger an optimize. This will generate files of 1GB (so larger than the standard size of auto optimize). And of course the Z-Order if necessary.The suggestion to ...

  • 2 kudos
3 More Replies
Constantine
by Contributor III
  • 9028 Views
  • 4 replies
  • 4 kudos

Resolved! How does Spark do lazy evaluation?

For context, I am running Spark on databricks platform and using Delta Tables (s3). Let's assume we a table called table_one. I create a view called view_one using the table and then call view_one. Next, I create another view, called view_two based o...

  • 9028 Views
  • 4 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi @John Constantine​ ,The following notebook url will help you to undertand better the difference between lazy transformations and action in Spark. You will be able to compare the physical query plans and undertand better what is going on when you e...

  • 4 kudos
3 More Replies
Anonymous
by Not applicable
  • 1983 Views
  • 2 replies
  • 2 kudos

Resolved! OPTIMIZE

I have been testing OPTIMIZE a huge set of data (about 775 million rows) and getting mixed results. When I tried on a 'string' column, the query return in 2.5mins and using the same column as 'integer', using the same query, it return 9.7 seconds. Pl...

  • 1983 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

@Werner Stinckens​  Thanks for your explanation.

  • 2 kudos
1 More Replies
Hubert-Dudek
by Esteemed Contributor III
  • 10270 Views
  • 5 replies
  • 17 kudos

Resolved! Optimize and Vacuum - which is the best order of operations?

Optimize -> VacuumorVacuum -> Optimize

  • 10270 Views
  • 5 replies
  • 17 kudos
Latest Reply
-werners-
Esteemed Contributor III
  • 17 kudos

I optimize first as delta lake knows which files are relevant for the optimize. Like that I have my optimized data available faster. Then a vacuum. Seemed logical to me, but I might be wrong. Never actually thought about it

  • 17 kudos
4 More Replies
irfanaziz
by Contributor II
  • 1694 Views
  • 2 replies
  • 3 kudos

Does anyone know why the optimize does not complete?

I feel there is some issue with a few partitions of the delta file. The optimize runs fine and completes within few minutes for other partitions but for this particular partition the optimize keeps running forever. OPTIMIZE delta.`/mnt/prod-abc/Ini...

  • 1694 Views
  • 2 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@nafri A​ - Thank you for letting us know.

  • 3 kudos
1 More Replies
Labels