Definitely configure job timeouts, and configure notifications.
This will help you to identify slowness due to various factors.
It is crucial to also investigate and fix the issue causing the slowness.
- The first step is to identify the problem. This can be done by comparing the run times of the same job at different instances.
- The next step is to analyze the details of the job. This can be done by checking the SQL query plan, the read time, and the cloud storage request duration, etc from the Spark UI.
- External factors such as storage and network can also affect the job run time. Checking Logs and some system level commands can give insights if the vm is still up and running.
While comparing the two runs (good and bad), try answering the following
- Is this an intermittent failure or it degraded after a certain point of time.
- If it is slow after a certain time consistently was there a DBR change?
- Is the volume of the data the same?
- Have the cluster configs changed/updated?
While comparing the DAGs and sql plan
- Look for the stages that took most time.
- Use filters and reduce data size.
- Check the joins, metrics.
In the logs
- You can check for errors, warnings.
- Go to the timestamp when the stage was delayed.
- Compare it with a run where it took the expected time.
Bonus tip - Enable speculative execution for tasks to re-run slow tasks in parallel. spark.speculation=true
Keep a look out for my new post for more tuning on a slow task/jobs.