Databricks

pantelis_mare · ‎05-04-2022

Hello all,

I have a large number of light notebooks to run so I am taking the concurrent approach launching notebook runs with dbutils.notebook.run in parallel.

The more I increase parallelism the more I see the duration of each notebook increasing.

I observe that the duration of the cell that includes the imports increases with parallelism up to 20-30 secs:

import org.apache.spark.sql.functions.{col, lit, to_date, date_format}
import org.apache.spark.sql.types.{DateType, TimestampType, IntegerType}
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException

Same problem with the cell containing the implicits import statement

import spark.implicits._

FYI I am using a parallelism at max the half of the available drive cores (eg 16 parallelism for an F32s driver)

Is there any strategy that can tackle this issue?

Thank you in advance,

-werners- · ‎05-05-2022

@Pantelis Maroudis , every notebook will create it's own sparkcontext, and every context means overhead.

The number of cores is not the only metric but also the memory and disks.

Using this approach will also be a heavier burden on the driver.

View solution in original post

-werners- · ‎05-05-2022

@Pantelis Maroudis , every notebook will create it's own sparkcontext, and every context means overhead.

The number of cores is not the only metric but also the memory and disks.

Using this approach will also be a heavier burden on the driver.

Hubert-Dudek · ‎05-07-2022

@Pantelis Maroudis , yes as @Werner Stinckens said it is parallelism on driver which will send anyway as spark jobs in the queue to workers, and every CPU will work step by step on 1 partition at the same time... I used ThreadPool often in the past then I stopped as it is a bit nonsense in case when your code is correct (is designed to work on executors not on driver) 🙂

for every notebook reserve, some resources using separate pools spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool name")
you can just set them to run in parallel using jobs/tasks - one ***** task and all other tasks depended on that 1 task as on that image:

pantelis_mare · ‎05-18-2022

Hello @Hubert Dudek ,

Thank you for the response and the help! Yes I tried to use the scheduler.pool but as you said the pool is for the spark resources. In my case the bottleneck is actually the driver scheduling the notebooks, not the spark scheduling. As a proof, I observed the same behavior with notebooks that did not interacti with spark at all

Kaniz · ‎05-12-2022

Hi @Pantelis Maroudis , Just a friendly follow-up. Do you still need help, or @Hubert Dudek (Customer) 's and @Werner Stinckens 's response help you to find the solution? Please let us know.

pantelis_mare · ‎05-18-2022

Hello @Kaniz Fatma yes it is clear.

Following some tests on my side using a ***** notebook that all it does is importing stuff and sleeping for 15 secs (so nothing to do with spark) I figured that even with a 32 cores driver, the fatigue point is close to 6 concurrent notebooks. Meaning that it's not even a question of available cores per notebook because is close to the fatigue point of a 16 or 8 cores driver.

Databricks

Slow imports for concurrent notebooks

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI