cancel
Showing results for 
Search instead for 
Did you mean: 
Databricks Academy Learners
cancel
Showing results for 
Search instead for 
Did you mean: 

How will you decide your cluster configuration for your project. What are all the factors you will consider before choosing a optimal cluster. Lets have a discussion.

Harun
Honored Contributor
8 REPLIES 8

Anonymous
Not applicable
  • Machine learning then ML runtime
  • Deep learning then ML Runtime with GPU
  • If you're doing SQL + ETL then use Photon
  • Try to get as much RAM as you have data. Make sure to have extra RAM for oveheard. If you have 1 TB dataset, try to get 1.5TB of Ram.
  • Scale up before you scale out. Fewer machines means less network on shuffles.

Pat
Honored Contributor III

I like the scale-up before the scale-out.

I was used to run multiple clusters, but it does make sense. I like this page:

https://docs.databricks.com/clusters/cluster-config-best-practices.html

Thanks,

Pat.

Harun
Honored Contributor

Thanks @Joseph Kambourakis​ for the inputs

kpendergast
Contributor

The biggest factor is cost for compute. I start simple and adjust as needed. However if one block of code is creating a performance issue then that needs to be addressed as no cluster can make bad code better.

In general I analyze the overall runtime of a workflow and test different cluster sizes and instances types. After a few runs I check the metrics and see how its performing during the job and make adjustments to the instance types as necessary.

Some cases are special and need to be configured for the code you will be running. JDBC jobs for example need to configured for number of cores if you are looking to run on all nodes for ETL.

For BI platforms and Databricks SQL warehouses these clusters need to be monitored at the query level. If a query runs for several hours but the execution time is a few minutes. I'd create a smaller cluster for it as most of the time is spent waiting on the BI platform to ingest the data.

For ML it all depends on the models and data. Start simple and adjust as needed. Some libraries and packages may need GPUs and some may not need more than a single instance.

for what its worth some operations will store a lot of info on the master node I set a spark config to make all but 1GB of memory using spark.driver.maxResultSize

ranged_coop
Valued Contributor II

Can you please help on below points ?

  1. How do you decide on the job vs interactive cluster confusion ? I have a scenario where we have a hourly job that almost runs for more than 35 minutes. We have other jobs that run for a much smaller time. In such cases how would you decide between them and is it still cheaper to go for separate job clusters in such a case ?
  2. Is there a standard way to check how much peak memory a job uses ? For example, one of the jobs I have come across has a source data of around 700 mb, references DBMS tables as views (Which I assume is loaded into memory ? Can anyone confirm ?) and has some looping ETL logic written in pyspark...in such a case what would be the ideal way to identify how much peak memory this job uses and then create a cluster based on that ?

Harun
Honored Contributor

@Bharath Kumar Ramachandran​ going with job cluster will be cheaper i believe.

Ajay-Pandey
Esteemed Contributor III

In my project, we generally decide cluster based on the data, complexity of the code, and time.

Harun
Honored Contributor

@Ajay Pandey​ great

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!