cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Lot of write shuffle on optimize + ZORDER, is it normal?

alejandrofm
Valued Contributor

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.

It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context:image 

image.pngCould I be missing something or is this the expected behavior?

Thanks!

----Edit----

Also a lot of computation in "locality" = ANY

image

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Alejandro Martinez​ :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Alejandro Martinez​ :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

Anonymous
Not applicable

Hi @Alejandro Martinez​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group