cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Slow imports for concurrent notebooks

pantelis_mare
Contributor III

Hello all,

I have a large number of light notebooks to run so I am taking the concurrent approach launching notebook runs with dbutils.notebook.run in parallel.

The more I increase parallelism the more I see the duration of each notebook increasing.

I observe that the duration of the cell that includes the imports increases with parallelism up to 20-30 secs:

import org.apache.spark.sql.functions.{col, lit, to_date, date_format}
import org.apache.spark.sql.types.{DateType, TimestampType, IntegerType}
import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.catalyst.analysis.NoSuchTableException

Same problem with the cell containing the implicits import statement

import spark.implicits._

FYI I am using a parallelism at max the half of the available drive cores (eg 16 parallelism for an F32s driver)

Is there any strategy that can tackle this issue?

Thank you in advance,

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@Pantelis Maroudis​ , every notebook will create it's own sparkcontext, and every context means overhead.

The number of cores is not the only metric but also the memory and disks.

Using this approach will also be a heavier burden on the driver.

View solution in original post

4 REPLIES 4

-werners-
Esteemed Contributor III

@Pantelis Maroudis​ , every notebook will create it's own sparkcontext, and every context means overhead.

The number of cores is not the only metric but also the memory and disks.

Using this approach will also be a heavier burden on the driver.

Hubert-Dudek
Esteemed Contributor III

@Pantelis Maroudis​ , yes as @Werner Stinckens​ said it is parallelism on driver which will send anyway as spark jobs in the queue to workers, and every CPU will work step by step on 1 partition at the same time... I used ThreadPool often in the past then I stopped as it is a bit nonsense in case when your code is correct (is designed to work on executors not on driver) 🙂

  • for every notebook reserve, some resources using separate pools spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool name")
  • you can just set them to run in parallel using jobs/tasks - one ***** task and all other tasks depended on that 1 task as on that image:image.png

Hello @Hubert Dudek​ ,

Thank you for the response and the help! Yes I tried to use the scheduler.pool but as you said the pool is for the spark resources. In my case the bottleneck is actually the driver scheduling the notebooks, not the spark scheduling. As a proof, I observed the same behavior with notebooks that did not interacti with spark at all

pantelis_mare
Contributor III

Hello @Kaniz Fatma​ yes it is clear.

Following some tests on my side using a ***** notebook that all it does is importing stuff and sleeping for 15 secs (so nothing to do with spark) I figured that even with a 32 cores driver, the fatigue point is close to 6 concurrent notebooks. Meaning that it's not even a question of available cores per notebook because is close to the fatigue point of a 16 or 8 cores driver.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group