Databricks

Harun · ‎12-07-2022

Anonymous · ‎12-07-2022

Machine learning then ML runtime
Deep learning then ML Runtime with GPU
If you're doing SQL + ETL then use Photon
Try to get as much RAM as you have data. Make sure to have extra RAM for oveheard. If you have 1 TB dataset, try to get 1.5TB of Ram.
Scale up before you scale out. Fewer machines means less network on shuffles.

Pat · ‎12-07-2022

I like the scale-up before the scale-out.

I was used to run multiple clusters, but it does make sense. I like this page:

https://docs.databricks.com/clusters/cluster-config-best-practices.html

Thanks,

Pat.

Harun · ‎12-08-2022

Thanks @Joseph Kambourakis for the inputs

kpendergast · ‎12-07-2022

The biggest factor is cost for compute. I start simple and adjust as needed. However if one block of code is creating a performance issue then that needs to be addressed as no cluster can make bad code better.

In general I analyze the overall runtime of a workflow and test different cluster sizes and instances types. After a few runs I check the metrics and see how its performing during the job and make adjustments to the instance types as necessary.

Some cases are special and need to be configured for the code you will be running. JDBC jobs for example need to configured for number of cores if you are looking to run on all nodes for ETL.

For BI platforms and Databricks SQL warehouses these clusters need to be monitored at the query level. If a query runs for several hours but the execution time is a few minutes. I'd create a smaller cluster for it as most of the time is spent waiting on the BI platform to ingest the data.

For ML it all depends on the models and data. Start simple and adjust as needed. Some libraries and packages may need GPUs and some may not need more than a single instance.

for what its worth some operations will store a lot of info on the master node I set a spark config to make all but 1GB of memory using spark.driver.maxResultSize

ranged_coop · ‎12-08-2022

Can you please help on below points ?

How do you decide on the job vs interactive cluster confusion ? I have a scenario where we have a hourly job that almost runs for more than 35 minutes. We have other jobs that run for a much smaller time. In such cases how would you decide between them and is it still cheaper to go for separate job clusters in such a case ?
Is there a standard way to check how much peak memory a job uses ? For example, one of the jobs I have come across has a source data of around 700 mb, references DBMS tables as views (Which I assume is loaded into memory ? Can anyone confirm ?) and has some looping ETL logic written in pyspark...in such a case what would be the ideal way to identify how much peak memory this job uses and then create a cluster based on that ?