cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

The possibility of finding the workload dynamically and spin up the cluster based on the workload

Arunsundar
New Contributor III

Hi Team,

Good morning.

I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can spin up a required optimal cluster type having control over the minimum/maximum number of workers required to complete the workload efficiently.

I also would like to understand whether cluster determination can be done only based on running the workload with a trial-and-error method by attaching various types of clusters in the Dev environment and arriving at the optimal cluster that we attach in higher environments.

Kindly let me know if you have any further questions.

Thanks

5 REPLIES 5

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, Didnt get your question, could you please elaborate. Do you want to get the workload through the code you deploy?

Please tag @Debayan​ with your next response which will notify me, Thank you!

Arunsundar
New Contributor III

Yes @Debayan Mukherjee​, Need to get the workload through the code and spin up the necessary/optimal cluster based on the workload

Debayan
Esteemed Contributor III
Esteemed Contributor III

Hi, And how you will be running the workload through the code? Will there be any resource involved or how is it?

pvignesh92
Honored Contributor

Hi @Arunsundar Muthumanickam​ ,

When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in different environments, enabling Cluster Autoscaling with min and max workers would be an ideal choice as more workers might be added depending on your workloads(number of partitions).

If your workload has a shuffle phase i.e. joins, groupby, etc. please check if can tweak this number or you can set to auto so that the Spark optimizer can change them as per your partition sizes.

Below is some sample code, how you can get the distribution of data in your partitions.

from pyspark.sql.functions import spark_partition_id, asc, desc
df\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .show()

Kaniz
Community Manager
Community Manager

Hi @Arunsundar Muthumanickam​ ​​, We haven't heard from you since the last response from @Vigneshraja Palaniraj​ and @Debayan Mukherjee​​, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.