Databricks Community

Arunsundar · ‎03-12-2023

Hi Team,

Good morning.

I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can spin up a required optimal cluster type having control over the minimum/maximum number of workers required to complete the workload efficiently.

I also would like to understand whether cluster determination can be done only based on running the workload with a trial-and-error method by attaching various types of clusters in the Dev environment and arriving at the optimal cluster that we attach in higher environments.

Kindly let me know if you have any further questions.

Thanks

Debayan · ‎03-12-2023

Hi, Didnt get your question, could you please elaborate. Do you want to get the workload through the code you deploy?

Please tag @Debayan with your next response which will notify me, Thank you!

Arunsundar · ‎03-12-2023

Yes @Debayan Mukherjee, Need to get the workload through the code and spin up the necessary/optimal cluster based on the workload

Debayan · ‎03-15-2023

Hi, And how you will be running the workload through the code? Will there be any resource involved or how is it?

pvignesh92 · ‎03-13-2023

Hi @Arunsundar Muthumanickam ,

When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in different environments, enabling Cluster Autoscaling with min and max workers would be an ideal choice as more workers might be added depending on your workloads(number of partitions).

If your workload has a shuffle phase i.e. joins, groupby, etc. please check if can tweak this number or you can set to auto so that the Spark optimizer can change them as per your partition sizes.

Below is some sample code, how you can get the distribution of data in your partitions.

from pyspark.sql.functions import spark_partition_id, asc, desc
df\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .show()

Databricks Community

The possibility of finding the workload dynamically and spin up the cluster based on the workload

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!