Databricks Community

Thor · ‎07-31-2023

Hello,

I'm facing a problem with big tarballs to decompress and to fit in memory I had to limit Spark processing too many files at the same time so I changed the following property on my 8 cores VMs cluster:

spark.task.cpus 4

This setting is the threshold before I get spill or OOM errors when decompressing tarballs

But, for the next stages of my pipeline, I would like to use the cluster at its maximum capacity by setting back:

spark.task.cpus 1

Currently, as a workaround I had to store the intermediate results and read the data with an other cluster with the proper setting.

My question is: can I change dynamically spark.task.cpus for each stage or transformation?

Same problem with no answer:

https://stackoverflow.com/questions/40759007/dynamic-cpus-per-task-in-spark

jose_gonzalez · ‎08-16-2023

Hi @Thor,

Spark does not offer the capability to dynamically modify configuration settings, such as spark.task.cpus, for individual stages or transformations while the application is running. Once a configuration property is set for a Spark application, it remains constant throughout its entire execution. This is a cluster setting and the only way to disabled/enabled it, will be to restart your cluster.

If you're seeking a more flexible approach for resource allocation, you could explore Spark's built-in dynamic allocation feature, denoted as spark.dynamicAllocation.enabled. This feature can be coupled with the refinement of properties like spark.executor.cores and spark.executor.memory. This combination allows Spark to automatically adapt the number of executors based on the workload. It's worth noting, however, that even with this method, you still cannot dynamically modify spark.task.cpus on a per-stage basis.

I hope this helps.

View solution in original post

jose_gonzalez · ‎08-16-2023

Hi @Thor,

Spark does not offer the capability to dynamically modify configuration settings, such as spark.task.cpus, for individual stages or transformations while the application is running. Once a configuration property is set for a Spark application, it remains constant throughout its entire execution. This is a cluster setting and the only way to disabled/enabled it, will be to restart your cluster.

If you're seeking a more flexible approach for resource allocation, you could explore Spark's built-in dynamic allocation feature, denoted as spark.dynamicAllocation.enabled. This feature can be coupled with the refinement of properties like spark.executor.cores and spark.executor.memory. This combination allows Spark to automatically adapt the number of executors based on the workload. It's worth noting, however, that even with this method, you still cannot dynamically modify spark.task.cpus on a per-stage basis.

I hope this helps.

Databricks Community

Dynamically change spark.task.cpus

Join Us as a Local Community Builder!

Announcing Backfill Runs in Lakeflow Jobs for Higher Quality Downstream Data

🚀 New: Databricks Interactive Architecture Design Workshops

Introducing Community Pulse — Your Weekly Databricks Roundup!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

Databricks DevConnect I Washington D.C.