cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Performance Tuning of Databricks Notebook

HariharaSam
Contributor

Hi Everyone ,

I am trying to run a databricks notebook in parallel using ThreadPoolExecutor .

Can anyone suggest how to reduce the time taken based on the below findings so far.

Current Performance:

Time taken - 25 minutes

ThreadPoolExecutor max_workers - 24

Current Cluster configuration :

DBR - 9.1 LTS

Min workers - 2

Max workers - 6

Number of cores - 4 per worker

Memory - 14 GB per worker

Auto Scaling enabled

I tried increasing the number of workers to 18 hoping it would reduce the time taken but it didn't actually help.

Any thoughts on how to reduce the time ..

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

ThreadPoolExecutor will not help as Databricks/Spark will process job by job.

So please analyze in Spark UI what is consuming the most time.

There are a lot of tips on how to optimize they depend on the dataset (size etc. transformations)

  • Look for data skews; some partitions can be huge, some small because of incorrect partitioning. You can use Spark UI to do that but also debug your code a bit (get getNumPartitions()), especially SQL can divide it unequally to partitions (there are settings in connector lowerBound, etc.). You could try to have the number of partitions as workers' cores multiply by X (so they will be processed step by step in the queue),
  • Check for data spills in Spark UI as they mean writing shuffle partitions from RAM to disks and back. (25th, 50th, and 75th percentile should be similar). Increase shuffle partitions if they have to be frequently written to disk.
  • Increase shuffle size spark.SQL.shuffle.partitions default is 200 try bigger, you should calculate it as data size divided by the size of the partition,
  • Increase the size of the driver to be two times bigger than the executor (but to get the optimal size, please analyze load - in databricks on cluster tab look to Metrics there is Ganglia or even better integrate datadog with cluster),
  • Check wide transformations, ones that need to shuffle data between partitions, group them to do one shuffle only,
  • If you need to filter data, if possible, do it after reading from SQL so it will be predicative push so it will add where in SQL query,
  • Make sure that everything runs in a distributed way, specially UDF. It would help if you used vectorized pandas udfs so that they will run on executors. Even pandas UDFs can be slow. You can try instead of registering the SQL function in metastore and use it in your SQL queries.
  • Don't use collect etc., as they are not running in the distributed way but instead are loading everything to memory object on the driver (it is like executing notebook locally)
  • Regarding infrastructure, use more workers and check that your ADLS is connected via a private link.
  • You can also use premium ADLS which is faster.
  • Sometimes I process big data as a stream as it is easier with big data sets. In that scenario, you would need Kafka (confluent cloud is excellent) between SQL and Databricks. If you use the stream for one-time processing, please use AvailableNow instead of Once.
  • When you read data, remember that if you read everything from a big unpartitioned file (around 200 MB each) will be faster. So for a delta, you can OPTIMIZE before reading. Of course, if you read, for example, last x days partitioning per day will be faster, but running OPTIMIZE can help anyway.
  • Regarding writing, the best is that you get as many files as cores/partitions by default, so every core in parallel is working on writing one file. Later you can merge them by mentioning OPTIMIZE. Please check that every file is similar in size. If not, it seems that you have a skew in partitions. In a huge dataset, sometimes it is good to salt number and then partition by that number to make than equal (you can get the number of cores from SparkContext.DefaultParallelism property)

Hi @Hubert Dudek​ ,

You have mentioned that ThreadPoolExecutor​ will not help , so if I want to run a same databricks notebooks for 100 different input values and running them in sequence takes more time to complete.

So how to achieve this scenario?​

Hubert-Dudek
Esteemed Contributor III

Orchestrate everything as one workload, and every notebook run will have different parameters (something like the below image). You can create one ***** task, and all depend on it so that they will use the same machine (another setting where you use a pool of servers and every task using a different machine is also possible).

If you want really to have 100 hundred notebooks on 1 cluster in parallel, you can set a unique job pool for every notebook execution so they will have reserved resources (just the name of the pool need to be different).

sc.setLocalProperty("spark.scheduler.pool", "somename")

image.png

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.