How will you decide your cluster configuration for your project. What are all the factors you will consider before choosing a optimal cluster. Lets have a discussion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-07-2022 06:15 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-07-2022 08:49 AM
- Machine learning then ML runtime
- Deep learning then ML Runtime with GPU
- If you're doing SQL + ETL then use Photon
- Try to get as much RAM as you have data. Make sure to have extra RAM for oveheard. If you have 1 TB dataset, try to get 1.5TB of Ram.
- Scale up before you scale out. Fewer machines means less network on shuffles.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-07-2022 10:35 AM
I like the scale-up before the scale-out.
I was used to run multiple clusters, but it does make sense. I like this page:
https://docs.databricks.com/clusters/cluster-config-best-practices.html
Thanks,
Pat.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2022 12:57 AM
Thanks @Joseph Kambourakis for the inputs
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-07-2022 10:13 AM
The biggest factor is cost for compute. I start simple and adjust as needed. However if one block of code is creating a performance issue then that needs to be addressed as no cluster can make bad code better.
In general I analyze the overall runtime of a workflow and test different cluster sizes and instances types. After a few runs I check the metrics and see how its performing during the job and make adjustments to the instance types as necessary.
Some cases are special and need to be configured for the code you will be running. JDBC jobs for example need to configured for number of cores if you are looking to run on all nodes for ETL.
For BI platforms and Databricks SQL warehouses these clusters need to be monitored at the query level. If a query runs for several hours but the execution time is a few minutes. I'd create a smaller cluster for it as most of the time is spent waiting on the BI platform to ingest the data.
For ML it all depends on the models and data. Start simple and adjust as needed. Some libraries and packages may need GPUs and some may not need more than a single instance.
for what its worth some operations will store a lot of info on the master node I set a spark config to make all but 1GB of memory using spark.driver.maxResultSize
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2022 12:28 AM
Can you please help on below points ?
- How do you decide on the job vs interactive cluster confusion ? I have a scenario where we have a hourly job that almost runs for more than 35 minutes. We have other jobs that run for a much smaller time. In such cases how would you decide between them and is it still cheaper to go for separate job clusters in such a case ?
- Is there a standard way to check how much peak memory a job uses ? For example, one of the jobs I have come across has a source data of around 700 mb, references DBMS tables as views (Which I assume is loaded into memory ? Can anyone confirm ?) and has some looping ETL logic written in pyspark...in such a case what would be the ideal way to identify how much peak memory this job uses and then create a cluster based on that ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2022 12:59 AM
@Bharath Kumar Ramachandran going with job cluster will be cheaper i believe.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2022 01:14 AM
In my project, we generally decide cluster based on the data, complexity of the code, and time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-08-2022 08:51 AM
@Ajay Pandey great

