Databricks Community

Akuhei05 · ‎05-27-2024

Hi!

I need help with the following:

Programmatically retrieve the maximum memory configured for the cluster attached to the notebook/job - I think this is achievable through the system tables or Clusters API, but I'm open to other suggestions
Execute a job on this cluster and, upon its completion, determine the amount of memory utilized during the job and get this information programmatically inside a simple notebook - Note: GangliaUI is out of question, we are using LTS 13.3. We also have a Spark-based Listener implemented, the logs are ingested to ADX. However, I haven't found a metric like this.

Could you provide guidance so that I can create a Delta table that includes these statistics?

Thank you!

anardinelli · ‎05-27-2024

Hi @Akuhei05 how are you?

For the first topic, you can create a cell on your notebook that gets the spark configuration for max memory every time it is ran, relating to your cluster attached. For this, please see below:

spark_memory = spark.sparkContext.getConf().get('spark.executor.memory')
print(spark_memory)

For the second point, when you say "determine the amount of memory utilized during the job" is it related to the maximum used in total? per worker? is it the sum of it?

Best,

Alessandro

Akuhei05 · ‎05-27-2024

Hi Alessandro,

Thank you for your help and suggestion!

For the second point, I’m looking to analyze the memory utilization over the duration of the job. Specifically, I want to know the average & total memory used during a single job run compared to the total memory available in that specific cluster - set by prior configuration. However, any additional useful metrics (like per worker) that I can access in the notebook would also be appreciated.

I'm thinking of creating a Delta table to save these statistics to. I'd like to run performance tests for specific use cases and want to see how certain metrics change with different types of clusters used for a certain amounts of records to have a baseline. Later, we plan to find a way to integrate this into our CI/CD pipeline to optionally track how much our changes could affect the baseline performance on an "approximate" level.

anardinelli · ‎05-27-2024

Great use case!

Have you ever heard about Prometheus with Spark 3.0? Its a tool that can export live metrics for your jobs and runs which writes to a sink where you can read with a stream. I've personally never used in such use case, but there you can monitor every metric and write it out to then create some insights of it (such as averages and totals) on a different pipeline, which then can finally become a table.

To better understand, you can check these links below:

1. Session on how to use and enable Prometheus in databricks: https://www.youtube.com/watch?v=FDzm3MiSfiE

2. Spark official guide: https://spark.apache.org/docs/3.1.1/monitoring.html

Best,

Alessandro