cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Forum Posts

alejandrofm
by Valued Contributor
  • 3209 Views
  • 2 replies
  • 2 kudos

Resolved! Lot of write shuffle on optimize + ZORDER, is it normal?

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context: Could I be missing something...

image image.png image
  • 3209 Views
  • 2 replies
  • 2 kudos
Latest Reply
Anonymous
Not applicable
  • 2 kudos

Hi @Alejandro Martinez​ Thank you for posting your question in our community! We are happy to assist you.To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best an...

  • 2 kudos
1 More Replies
User16835756816
by Valued Contributor
  • 2860 Views
  • 3 replies
  • 1 kudos

How can I optimize my data pipeline?

Delta Lake provides optimizations that can help you accelerate your data lake operations. Here’s how you can improve query speed by optimizing the layout of data in storage.There are two ways you can optimize your data pipeline: 1) Notebook Optimizat...

  • 2860 Views
  • 3 replies
  • 1 kudos
Latest Reply
Hubert-Dudek
Esteemed Contributor III
  • 1 kudos

some tips from me:Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to parti...

  • 1 kudos
2 More Replies
MaximS
by New Contributor
  • 1373 Views
  • 1 replies
  • 1 kudos

OPTIMIZE command failed to complete on partitioned dataset

Trying to optimize delta table with following stats:size: 212,848 blobs, 31,162,417,246,985 bytescommand: OPTIMIZE <table> ZORDER BY (X, Y, Z)In Spark UI I can see all work divided to batches, and each batch start with 400 tasks to collect data. But ...

  • 1373 Views
  • 1 replies
  • 1 kudos
Latest Reply
Aviral-Bhardwaj
Esteemed Contributor III
  • 1 kudos

can you share some sample datasets for this by that we can debug and help you accordingly ThanksAviral

  • 1 kudos
NOOR_BASHASHAIK
by Contributor
  • 4011 Views
  • 4 replies
  • 4 kudos

Azure Databricks VM type for OPTIMIZE with ZORDER on a single column

DearsI was trying to check what Azure Databricks VM type is best suited for executing OPTIMIZE with ZORDER on a single timestamp value (but string data type) column for around 5000+ tables in the Delta Lake.I chose Standard_F16s_v2 with 6 workers & 1...

image image image image
  • 4011 Views
  • 4 replies
  • 4 kudos
Latest Reply
jose_gonzalez
Databricks Employee
  • 4 kudos

Hi,The Standard_F16s_v2 is a compute optimize type machine. On the other-hand, for delta optimize (both bin-packing and Z-Ordering), we recommend Stabdard_DS_v2-series. Also, follow Hubert's recommendations.

  • 4 kudos
3 More Replies
AP
by New Contributor III
  • 4025 Views
  • 5 replies
  • 3 kudos

Resolved! AutoOptimize, OPTIMIZE command and Vacuum command : Order, production implementation best practices

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.Should we enable "optimized writes" by setting the following at a workspace level?spark.conf.set...

  • 4025 Views
  • 5 replies
  • 3 kudos
Latest Reply
Anonymous
Not applicable
  • 3 kudos

@AKSHAY PALLERLA​ Just checking in to see if you got a solution to the issue you shared above. Let us know!Thanks to @Werner Stinckens​ for jumping in, as always!

  • 3 kudos
4 More Replies
PJ
by New Contributor III
  • 2067 Views
  • 3 replies
  • 3 kudos

Resolved! How should you optimize <1GB delta tables?

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger. I am working with a ~7million row ...

  • 2067 Views
  • 3 replies
  • 3 kudos
Latest Reply
PJ
New Contributor III
  • 3 kudos

Thank you @Hubert Dudek​ !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardina...

  • 3 kudos
2 More Replies
brickster_2018
by Databricks Employee
  • 840 Views
  • 1 replies
  • 0 kudos
  • 840 Views
  • 1 replies
  • 0 kudos
Latest Reply
amr
Databricks Employee
  • 0 kudos

If the data in your table is huge, try to combine OPTIMIZE with WHERE so you only perform OPTIMIZE on a subset of the data rather than all data. see documentation here.

  • 0 kudos
User16826992666
by Valued Contributor
  • 2204 Views
  • 1 replies
  • 0 kudos

Resolved! How often should I run OPTIMIZE on my Delta Tables?

I know it's important to periodically run Optimize on my Delta tables, but how often should I be doing this? Am I supposed to do this after every time I load data?

  • 2204 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

It would depend on how frequently you update the table and how often you read it. If you have a daily ETL job updating a delta table, it might make sense to run OPTIMIZE at the end of it so that subsequent reads would benefit from the performance imp...

  • 0 kudos
User16790091296
by Contributor II
  • 1228 Views
  • 1 replies
  • 0 kudos

What’s the best instance type to run OPTIMIZE (bin-packing and Z-Ordering) on?

I've been doing some research on optimizing data storage while implementing delta, however, I'm not sure which instance type would be best for this.

  • 1228 Views
  • 1 replies
  • 0 kudos
Latest Reply
sajith_appukutt
Honored Contributor II
  • 0 kudos

OPTIMIZE as you alluded has two operations , Bin-packing and multi-dimensional clustering ( zorder)Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effectZ-Ordering is not idempotent b...

  • 0 kudos
Labels