cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Lot of write shuffle on optimize + ZORDER, is it normal?

alejandrofm
Valued Contributor

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.

It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context:image 

image.pngCould I be missing something or is this the expected behavior?

Thanks!

----Edit----

Also a lot of computation in "locality" = ANY

image

1 ACCEPTED SOLUTION

Accepted Solutions

Anonymous
Not applicable

@Alejandro Martinez​ :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

@Alejandro Martinez​ :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

Anonymous
Not applicable

Hi @Alejandro Martinez​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.