cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Z-ordering optimization with multithreading

yliu
New Contributor III

Hi, 

I am wondering if multithreading will help with the performance for z-ordering optimization on multiple delta tables.

We are periodically doing optimization on thousands of tables and it easily takes a few days to finish the job. So we are looking for a way to optimize a number of tables in parallel. Will using multithreading make sense here speed up the process? We did a few rounds of testing in dev environment and it seems the optimization with multithreading does a better job. But we couldn't be sure as the tables in dev are not updated very frequently so sometimes the optimization is not actually writing anything. 

Thank you in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @yliu , using multithreading can indeed help with the performance of Z-ordering optimization on multiple Delta tables. It is mentioned that the OPTIMIZE command has been improved to commit batches as soon as possible, instead of at the end, and the default number of threads OPTIMIZE runs in parallel has been reduced, which is a strict performance increase for large tables.

However, it's important to note that the effectiveness of multithreading will depend on the specific characteristics of your tables and your computing environment. For example, if your tables are very large and have many columns, using multithreading might speed up the process significantly. On the other hand, if your tables are smaller and have fewer columns, the improvement might be less noticeable.Furthermore, the OPTIMIZE operation now uses Hilbert space-filling curves by default, which provides better clustering characteristics than Z-order in higher dimensions. This approach can speed up read queries by skipping more data than Z-order.

So, based on your testing in a dev environment and the provided information, it does make sense to use multithreading to speed up the Z-ordering optimization process.

However, as the tables in dev are not updated very frequently, it would be advisable to continue monitoring and testing this approach in your production environment to ensure it continues to provide the desired performance improvements.

View solution in original post

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @yliu , using multithreading can indeed help with the performance of Z-ordering optimization on multiple Delta tables. It is mentioned that the OPTIMIZE command has been improved to commit batches as soon as possible, instead of at the end, and the default number of threads OPTIMIZE runs in parallel has been reduced, which is a strict performance increase for large tables.

However, it's important to note that the effectiveness of multithreading will depend on the specific characteristics of your tables and your computing environment. For example, if your tables are very large and have many columns, using multithreading might speed up the process significantly. On the other hand, if your tables are smaller and have fewer columns, the improvement might be less noticeable.Furthermore, the OPTIMIZE operation now uses Hilbert space-filling curves by default, which provides better clustering characteristics than Z-order in higher dimensions. This approach can speed up read queries by skipping more data than Z-order.

So, based on your testing in a dev environment and the provided information, it does make sense to use multithreading to speed up the Z-ordering optimization process.

However, as the tables in dev are not updated very frequently, it would be advisable to continue monitoring and testing this approach in your production environment to ensure it continues to provide the desired performance improvements.

yliu
New Contributor III

Thank you for the detailed explanation and quick response! I will proceed with multithreading then and keep monitoring it. Thanks a lot :))

yliu
New Contributor III

Hi Kaniz, 

A follow up question on this topic. How does python multithreading works with Spark? We are trying to understand how to decide the number of threads to maximum performance. And which one is better, multithreading or multiprocessing? Some of our tests show that, multiprocessing are faster. Could you shed some lights?

Thank you! 

Kaniz
Community Manager
Community Manager

Hi @yliu , 

Python's multithreading does not work well with Spark due to the Global Interpreter Lock (GIL) in Python
- The GIL allows only one thread to execute at a time in a single process
- Multithreading in Python does not lead to true parallelism
- Multithreading can be helpful to for I/O bound or network-bound programs
- Multiprocessing can bypass the GIL and achieve true parallelism by using multiple processes
- Each Python process gets its own interpreter and memory space in multiprocessing
- Multiprocessing is suitable for CPU-bound tasks
- The number of tasks for data processing in Spark can be specified while creating SparkContext
- The number of tasks can be adjusted based on available resources and type of operations
- For heavy computation tasks, having tasks less than or equal to the number of cores is beneficial
- For I/O bound or network bound tasks, having more tasks than cores can improve performance
- Python's multiprocessing is better for data-intensive and CPU-bound tasks
- Multithreading can be beneficial for I/O-bound or network-bound tasks
- Spark manages resources and task distribution, so structuring Spark computations is more critical than Python-level parallelism.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.