Databricks

HariharaSam · ‎09-20-2022

Hi Everyone ,

I am trying to run a databricks notebook in parallel using ThreadPoolExecutor .

Can anyone suggest how to reduce the time taken based on the below findings so far.

Current Performance:

Time taken - 25 minutes

ThreadPoolExecutor max_workers - 24

Current Cluster configuration :

DBR - 9.1 LTS

Min workers - 2

Max workers - 6

Number of cores - 4 per worker

Memory - 14 GB per worker

Auto Scaling enabled

I tried increasing the number of workers to 18 hoping it would reduce the time taken but it didn't actually help.

Any thoughts on how to reduce the time ..

Hubert-Dudek · ‎09-20-2022

ThreadPoolExecutor will not help as Databricks/Spark will process job by job.

So please analyze in Spark UI what is consuming the most time.

There are a lot of tips on how to optimize they depend on the dataset (size etc. transformations)

Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to partitions (there are settings in connector lowerBound, etc.). You could try to have the number of partitions as workers' cores multiply by X (so they will be processed step by step in the queue),
Check for data spills in Spark UI as they mean writing shuffle partitions from RAM to disks and back. (25th, 50th, and 75th percentile should be similar). Increase shuffle partitions if they have to be frequently written to disk.
Increase shuffle size spark.SQL.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by the size of the partition,
Increase the size of the driver to be two times bigger than the executor (but to get the optimal size, please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
Check wide transformations, ones that need to shuffle data between partitions, group them to do one shuffle only,
If you need to filter data, if possible, do it after reading from SQL so it will be predicative push so it will add where in SQL query,
Make sure that everything runs in a distributed way, specially UDF. It would help if you used vectorized pandas udfs so that they will run on executors. Even pandas UDFs can be slow. You can try instead of registering the SQL function in metastore and use it in your SQL queries.
Don't use collect etc., as they are not running in the distributed way but instead are loading everything to memory object on the driver (it is like executing notebook locally)
Regarding infrastructure, use more workers and check that your ADLS is connected via a private link.
You can also use premium ADLS which is faster.
Sometimes I process big data as a stream as it is easier with big data sets. In that scenario, you would need Kafka (confluent cloud is excellent) between SQL and Databricks. If you use the stream for one-time processing, please use AvailableNow instead of Once.
When you read data, remember that if you read everything from a big unpartitioned file (around 200 MB each) will be faster. So for a delta, you can OPTIMIZE before reading. Of course, if you read, for example, last x days partitioning per day will be faster, but running OPTIMIZE can help anyway.
Regarding writing, the best is that you get as many files as cores/partitions by default, so every core in parallel is working on writing one file. Later you can merge them by mentioning OPTIMIZE. Please check that every file is similar in size. If not, it seems that you have a skew in partitions. In a huge dataset, sometimes it is good to salt number and then partition by that number to make than equal (you can get the number of cores from SparkContext.DefaultParallelism property)

HariharaSam · ‎09-20-2022

Hi @Hubert Dudek ,

You have mentioned that ThreadPoolExecutor will not help , so if I want to run a same databricks notebooks for 100 different input values and running them in sequence takes more time to complete.

So how to achieve this scenario?

Hubert-Dudek · ‎09-21-2022

Orchestrate everything as one workload, and every notebook run will have different parameters (something like the below image). You can create one ***** task, and all depend on it so that they will use the same machine (another setting where you use a pool of servers and every task using a different machine is also possible).

If you want really to have 100 hundred notebooks on 1 cluster in parallel, you can set a unique job pool for every notebook execution so they will have reserved resources (just the name of the pool need to be different).

sc.setLocalProperty("spark.scheduler.pool", "somename")

Databricks

Performance Tuning of Databricks Notebook

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs