cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Job stuck while utilizing all workers

PrebenOlsen
New Contributor III

Hi!

Started a job yesterday. It was iterating over data, 2-months at a time, and writing to a table. It was successfully doing this for 4 out of 6 time periods. The 5th time period however, got stuck, 5 hours in.

I can find one Failed Stage that reads 
org.apache.spark.SparkException: Failed to fetch spark://10.139.64.10:35257/jars/org_apache_sedona_sedona_sql_3_0_2_12_1_3_1_incubating.jar during dependency update
[at scala/java/spark..., threadpoolexecutor/executor...]
Caused by: java.io.IOException: No such file or directory


The pattern for the entire job was to use few workers with low CPU and Memory on read, then scaling up to 14 workers with high CPU and Memory on write. However, as it got stuck on the 5th period to process, workers, CPU and Memory was consistently high, occuring hundreds of dollars of cost over the next several hours. 

While all the workers were active, only one of them had an active task.
"WAITING on java.util.concurrent.locks.ReentrantLock"

Is this a Databricks issue? 

2 REPLIES 2

-werners-
Esteemed Contributor III

As Spark is lazy evaluated, using only small clusters for read and large ones for writes is not something that will happen.
The data is read when you apply an action (write f.e.).
That being said:  I have no knowledge of a bug in Databricks on clusters getting stuck and keeping consuming DBUs.
I think your code might be the issue here, as you mention 'iterating over data'.  That is something that should be avoided as much as possible (it is not always possible though).

Hi Werners, I agree about your explanation for read and write - but that's what the GUI looks like. For each iteration (spark.read.table.where(col("month") == "January") (and then February in the next iteration), it spends about 30 minutes on only 3 workers, until it finally boosts up to 14 workers for the next 60 minutes. What is it doing in those 30 minutes?

The code is very heavy, as it is iterating over data (within a period of time) to continuously remove rows based on some criterias. I'll make a new thread about this.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!