topic Predicting compute required to run Spark jobs in Data Engineering

Predicting compute required to run Spark jobs

kseyser — Sun, 19 May 2024 08:25:10 GMT

Im working on a project to predict compute (cores) required to run spark jobs. Has anyone work on this or something similar before? How did you get started?

Re: Predicting compute required to run Spark jobs

Yeshwanth — Mon, 20 May 2024 07:47:43 GMT

@kseyser good day,

This documentation might help you in your use-case: https://docs.databricks.com/en/compute/cluster-config-best-practices.html#compute-sizing-considerations

Kind regards,

Yesh

Re: Predicting compute required to run Spark jobs

kseyser — Tue, 21 May 2024 22:38:20 GMT

Hi @Yeshwanth, thank you for directing me to the documentation. I don't know much about computations, so I'm still figuring things out. So is there like a straight forward (standard) way to calculate the compute (no. of cores & memory) required to run spark jobs based on certain data volume of the job, frequency of the jobs, and number of jobs? I read that the data is generally partitioned into 128MB and the executor memory is divided into 300 MB reserved memory, 60% execution memory, and 40% storage memory. How would this help me calculate the compute for a data of size, say 1.5 TB?