03-12-2023 09:16 PM
Hi Team,
Good morning.
I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can spin up a required optimal cluster type having control over the minimum/maximum number of workers required to complete the workload efficiently.
I also would like to understand whether cluster determination can be done only based on running the workload with a trial-and-error method by attaching various types of clusters in the Dev environment and arriving at the optimal cluster that we attach in higher environments.
Kindly let me know if you have any further questions.
Thanks
03-12-2023 11:16 PM
Hi, Didnt get your question, could you please elaborate. Do you want to get the workload through the code you deploy?
Please tag @Debayan with your next response which will notify me, Thank you!
03-12-2023 11:18 PM
Yes @Debayan Mukherjee, Need to get the workload through the code and spin up the necessary/optimal cluster based on the workload
03-15-2023 10:41 PM
Hi, And how you will be running the workload through the code? Will there be any resource involved or how is it?
03-13-2023 10:40 AM
Hi @Arunsundar Muthumanickam ,
When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in different environments, enabling Cluster Autoscaling with min and max workers would be an ideal choice as more workers might be added depending on your workloads(number of partitions).
If your workload has a shuffle phase i.e. joins, groupby, etc. please check if can tweak this number or you can set to auto so that the Spark optimizer can change them as per your partition sizes.
Below is some sample code, how you can get the distribution of data in your partitions.
from pyspark.sql.functions import spark_partition_id, asc, desc
df\
.withColumn("partitionId", spark_partition_id())\
.groupBy("partitionId")\
.count()\
.orderBy(asc("count"))\
.show()
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group