topic Re: Job stuck while utilizing all workers in Data Engineering

Job stuck while utilizing all workers

PrebenOlsen — Wed, 17 Apr 2024 07:51:01 GMT

Hi!

Started a job yesterday. It was iterating over data, 2-months at a time, and writing to a table. It was successfully doing this for 4 out of 6 time periods. The 5th time period however, got stuck, 5 hours in.

I can find one Failed Stage that reads
org.apache.spark.SparkException: Failed to fetch spark://10.139.64.10:35257/jars/org_apache_sedona_sedona_sql_3_0_2_12_1_3_1_incubating.jar during dependency update
[at scala/java/spark..., threadpoolexecutor/executor...]
Caused by: java.io.IOException: No such file or directory

The pattern for the entire job was to use few workers with low CPU and Memory on read, then scaling up to 14 workers with high CPU and Memory on write. However, as it got stuck on the 5th period to process, workers, CPU and Memory was consistently high, occuring hundreds of dollars of cost over the next several hours.

While all the workers were active, only one of them had an active task.
"WAITING on java.util.concurrent.locks.ReentrantLock"

Is this a Databricks issue?

Re: Job stuck while utilizing all workers

-werners- — Wed, 17 Apr 2024 08:18:54 GMT

As Spark is lazy evaluated, using only small clusters for read and large ones for writes is not something that will happen.
The data is read when you apply an action (write f.e.).
That being said: I have no knowledge of a bug in Databricks on clusters getting stuck and keeping consuming DBUs.
I think your code might be the issue here, as you mention 'iterating over data'. That is something that should be avoided as much as possible (it is not always possible though).

Re: Job stuck while utilizing all workers

PrebenOlsen — Wed, 17 Apr 2024 08:24:38 GMT

Hi Werners, I agree about your explanation for read and write - but that's what the GUI looks like. For each iteration (spark.read.table.where(col("month") == "January") (and then February in the next iteration), it spends about 30 minutes on only 3 workers, until it finally boosts up to 14 workers for the next 60 minutes. What is it doing in those 30 minutes?

The code is very heavy, as it is iterating over data (within a period of time) to continuously remove rows based on some criterias. I'll make a new thread about this.