Hi!
Started a job yesterday. It was iterating over data, 2-months at a time, and writing to a table. It was successfully doing this for 4 out of 6 time periods. The 5th time period however, got stuck, 5 hours in.
I can find one Failed Stage that reads
org.apache.spark.SparkException: Failed to fetch spark://10.139.64.10:35257/jars/org_apache_sedona_sedona_sql_3_0_2_12_1_3_1_incubating.jar during dependency update
[at scala/java/spark..., threadpoolexecutor/executor...]
Caused by: java.io.IOException: No such file or directory
The pattern for the entire job was to use few workers with low CPU and Memory on read, then scaling up to 14 workers with high CPU and Memory on write. However, as it got stuck on the 5th period to process, workers, CPU and Memory was consistently high, occuring hundreds of dollars of cost over the next several hours.
While all the workers were active, only one of them had an active task.
"WAITING on java.util.concurrent.locks.ReentrantLock"
Is this a Databricks issue?