Databricks

Arunsundar · ‎03-12-2023

Hi Team,

Good morning.

I would like to understand if there is a possibility to determine the workload automatically through code (data load from a file to a table, determine the file size, kind of a benchmark that we can check), based on which we can spin up a required optimal cluster type having control over the minimum/maximum number of workers required to complete the workload efficiently.

I also would like to understand whether cluster determination can be done only based on running the workload with a trial-and-error method by attaching various types of clusters in the Dev environment and arriving at the optimal cluster that we attach in higher environments.

Kindly let me know if you have any further questions.

Thanks

Debayan · ‎03-12-2023

Hi, Didnt get your question, could you please elaborate. Do you want to get the workload through the code you deploy?

Please tag @Debayan with your next response which will notify me, Thank you!

Arunsundar · ‎03-12-2023

Yes @Debayan Mukherjee, Need to get the workload through the code and spin up the necessary/optimal cluster based on the workload

Debayan · ‎03-15-2023

Hi, And how you will be running the workload through the code? Will there be any resource involved or how is it?

pvignesh92 · ‎03-13-2023

Hi @Arunsundar Muthumanickam ,

When you say workload, I believe you might be handling various volumes of data between Dev and Prod environment. If you are using Databricks cluster and do not have much idea on how the volumes might turn out in different environments, enabling Cluster Autoscaling with min and max workers would be an ideal choice as more workers might be added depending on your workloads(number of partitions).

If your workload has a shuffle phase i.e. joins, groupby, etc. please check if can tweak this number or you can set to auto so that the Spark optimizer can change them as per your partition sizes.

Below is some sample code, how you can get the distribution of data in your partitions.

from pyspark.sql.functions import spark_partition_id, asc, desc
df\
    .withColumn("partitionId", spark_partition_id())\
    .groupBy("partitionId")\
    .count()\
    .orderBy(asc("count"))\
    .show()

Kaniz · ‎03-18-2023

Hi @Arunsundar Muthumanickam , We haven't heard from you since the last response from @Vigneshraja Palaniraj and @Debayan Mukherjee, and I was checking back to see if her suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Databricks

The possibility of finding the workload dynamically and spin up the cluster based on the workload

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI