cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Z-ordering optimization with multithreading

yliu
New Contributor III

Hi, 

I am wondering if multithreading will help with the performance for z-ordering optimization on multiple delta tables.

We are periodically doing optimization on thousands of tables and it easily takes a few days to finish the job. So we are looking for a way to optimize a number of tables in parallel. Will using multithreading make sense here speed up the process? We did a few rounds of testing in dev environment and it seems the optimization with multithreading does a better job. But we couldn't be sure as the tables in dev are not updated very frequently so sometimes the optimization is not actually writing anything. 

Thank you in advance for your help!

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @yliu , using multithreading can indeed help with the performance of Z-ordering optimization on multiple Delta tables. It is mentioned that the OPTIMIZE command has been improved to commit batches as soon as possible, instead of at the end, and the default number of threads OPTIMIZE runs in parallel has been reduced, which is a strict performance increase for large tables.

However, it's important to note that the effectiveness of multithreading will depend on the specific characteristics of your tables and your computing environment. For example, if your tables are very large and have many columns, using multithreading might speed up the process significantly. On the other hand, if your tables are smaller and have fewer columns, the improvement might be less noticeable.Furthermore, the OPTIMIZE operation now uses Hilbert space-filling curves by default, which provides better clustering characteristics than Z-order in higher dimensions. This approach can speed up read queries by skipping more data than Z-order.

So, based on your testing in a dev environment and the provided information, it does make sense to use multithreading to speed up the Z-ordering optimization process.

However, as the tables in dev are not updated very frequently, it would be advisable to continue monitoring and testing this approach in your production environment to ensure it continues to provide the desired performance improvements.

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @yliu , using multithreading can indeed help with the performance of Z-ordering optimization on multiple Delta tables. It is mentioned that the OPTIMIZE command has been improved to commit batches as soon as possible, instead of at the end, and the default number of threads OPTIMIZE runs in parallel has been reduced, which is a strict performance increase for large tables.

However, it's important to note that the effectiveness of multithreading will depend on the specific characteristics of your tables and your computing environment. For example, if your tables are very large and have many columns, using multithreading might speed up the process significantly. On the other hand, if your tables are smaller and have fewer columns, the improvement might be less noticeable.Furthermore, the OPTIMIZE operation now uses Hilbert space-filling curves by default, which provides better clustering characteristics than Z-order in higher dimensions. This approach can speed up read queries by skipping more data than Z-order.

So, based on your testing in a dev environment and the provided information, it does make sense to use multithreading to speed up the Z-ordering optimization process.

However, as the tables in dev are not updated very frequently, it would be advisable to continue monitoring and testing this approach in your production environment to ensure it continues to provide the desired performance improvements.

yliu
New Contributor III

Thank you for the detailed explanation and quick response! I will proceed with multithreading then and keep monitoring it. Thanks a lot :))

yliu
New Contributor III

Hi Kaniz, 

A follow up question on this topic. How does python multithreading works with Spark? We are trying to understand how to decide the number of threads to maximum performance. And which one is better, multithreading or multiprocessing? Some of our tests show that, multiprocessing are faster. Could you shed some lights?

Thank you! 

Kaniz_Fatma
Community Manager
Community Manager

Hi @yliu , 

Python's multithreading does not work well with Spark due to the Global Interpreter Lock (GIL) in Python
- The GIL allows only one thread to execute at a time in a single process
- Multithreading in Python does not lead to true parallelism
- Multithreading can be helpful to for I/O bound or network-bound programs
- Multiprocessing can bypass the GIL and achieve true parallelism by using multiple processes
- Each Python process gets its own interpreter and memory space in multiprocessing
- Multiprocessing is suitable for CPU-bound tasks
- The number of tasks for data processing in Spark can be specified while creating SparkContext
- The number of tasks can be adjusted based on available resources and type of operations
- For heavy computation tasks, having tasks less than or equal to the number of cores is beneficial
- For I/O bound or network bound tasks, having more tasks than cores can improve performance
- Python's multiprocessing is better for data-intensive and CPU-bound tasks
- Multithreading can be beneficial for I/O-bound or network-bound tasks
- Spark manages resources and task distribution, so structuring Spark computations is more critical than Python-level parallelism.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group