Databricks Community

igorgatis · ‎10-11-2023

I find it quite hard to understand Spark UI for my pyspark pipelines. For example, when one writes `spark.read.table("sometable").show()` it shows:

I learned that `DataFrame` API actually may spawn jobs before running the actual job. In the example above, job 15 collects data which is used in job 16. In both cases, the description gives no clue on what is going on.

Clicking on job 15 link, it shows a stage that looks like this:

Whose link leads to:

The job 16 is quite similar though it mentions the table name. Things get messier when DAG gets more complex.

Is there a recommended way to improve this? I'm aware of `setJobDescription`, `setLocalProperty` (with `callSite.short` and `callSite.long` but dealing with them directly is also not easy.