05-04-2022 04:18 AM
Hello all,
I have a large number of light notebooks to run so I am taking the concurrent approach launching notebook runs with dbutils.notebook.run in parallel.
The more I increase parallelism the more I see the duration of each notebook increasing.
I observe that the duration of the cell that includes the imports increases with parallelism up to 20-30 secs:
import org.apache.spark.sql.functions.{col, lit, to_date, date_format}
import org.apache.spark.sql.types.{DateType, TimestampType, IntegerType}
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException
Same problem with the cell containing the implicits import statement
import spark.implicits._
FYI I am using a parallelism at max the half of the available drive cores (eg 16 parallelism for an F32s driver)
Is there any strategy that can tackle this issue?
Thank you in advance,
05-05-2022 11:33 PM
@Pantelis Maroudis , every notebook will create it's own sparkcontext, and every context means overhead.
The number of cores is not the only metric but also the memory and disks.
Using this approach will also be a heavier burden on the driver.
05-05-2022 11:33 PM
@Pantelis Maroudis , every notebook will create it's own sparkcontext, and every context means overhead.
The number of cores is not the only metric but also the memory and disks.
Using this approach will also be a heavier burden on the driver.
05-07-2022 04:24 AM
@Pantelis Maroudis , yes as @Werner Stinckens said it is parallelism on driver which will send anyway as spark jobs in the queue to workers, and every CPU will work step by step on 1 partition at the same time... I used ThreadPool often in the past then I stopped as it is a bit nonsense in case when your code is correct (is designed to work on executors not on driver) 🙂
05-18-2022 04:49 AM
Hello @Hubert Dudek ,
Thank you for the response and the help! Yes I tried to use the scheduler.pool but as you said the pool is for the spark resources. In my case the bottleneck is actually the driver scheduling the notebooks, not the spark scheduling. As a proof, I observed the same behavior with notebooks that did not interacti with spark at all
05-18-2022 04:48 AM
Hello @Kaniz Fatma yes it is clear.
Following some tests on my side using a ***** notebook that all it does is importing stuff and sleeping for 15 secs (so nothing to do with spark) I figured that even with a 32 cores driver, the fatigue point is close to 6 concurrent notebooks. Meaning that it's not even a question of available cores per notebook because is close to the fatigue point of a 16 or 8 cores driver.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group