cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

The possibility of finding the workload dynamically and spin up the cluster based on the workload

Arunsundar
New Contributor III

Hi Team,

Good morning.

I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can spin up a required optimal cluster type having control over the minimum/maximum number of workers required to complete the workload efficiently.

I also would like to understand whether cluster determination can be done only based on running the workload with a trial-and-error method by attaching various types of clusters in the Dev environment and arriving at the optimal cluster that we attach in higher environments.

Kindly let me know if you have any further questions.

Thanks

4 REPLIES 4

Debayan
Databricks Employee
Databricks Employee

Hi, Didnt get your question, could you please elaborate. Do you want to get the workload through the code you deploy?

Please tag @Debayan​ with your next response which will notify me, Thank you!

Arunsundar
New Contributor III

Yes @Debayan Mukherjee​, Need to get the workload through the code and spin up the necessary/optimal cluster based on the workload

Debayan
Databricks Employee
Databricks Employee

Hi, And how you will be running the workload through the code? Will there be any resource involved or how is it?

pvignesh92
Honored Contributor

Hi @Arunsundar Muthumanickam​ ,

When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in different environments, enabling Cluster Autoscaling with min and max workers would be an ideal choice as more workers might be added depending on your workloads(number of partitions).

If your workload has a shuffle phase i.e. joins, groupby, etc. please check if can tweak this number or you can set to auto so that the Spark optimizer can change them as per your partition sizes.

Below is some sample code, how you can get the distribution of data in your partitions.

from pyspark.sql.functions import spark_partition_id, asc, desc
df\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .show()