Databricks Community

alejandrofm · ‎04-20-2023

Hi! I'm optimizing several Tb of partitioned data on ZSTD lvl 9.

It surprises me the level of shuffle write, it could make sense because of ZORDER but I want to be sure that I'm not missing something, here is some context:

Could I be missing something or is this the expected behavior?

Thanks!

----Edit----

Also a lot of computation in "locality" = ANY

Anonymous · ‎04-20-2023

@Alejandro Martinez :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

View solution in original post

Anonymous · ‎04-20-2023

@Alejandro Martinez :

It is normal to see a high level of shuffle write when optimizing partitioned data using ZORDER. ZORDER is an optimization technique that reorders the data within each partition based on the values of a specified column or columns. This helps to improve the performance of certain queries, but it requires the data to be shuffled around during the optimization process.

The amount of shuffle write can depend on factors such as the number of partitions, the size of the data, and the number of columns being used for ZORDER. In general, the more data being shuffled around, the more shuffle write you will see.

Regarding the high level of computation in "locality" = ANY, it means that the tasks are being scheduled on any available worker nodes in the cluster, regardless of their physical location. This can be a good thing as it allows for better utilization of resources, but it may result in higher network traffic and slower performance if the data needs to be transferred across nodes.

Overall, it seems like the behavior you are seeing is expected when optimizing partitioned data using ZORDER. However, if you are experiencing performance issues, you may want to experiment with different optimization techniques or partitioning schemes to see if you can improve the performance of your queries.

Anonymous · ‎04-23-2023

Hi @Alejandro Martinez

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!