Dear all,
What are some proven ways of capturing data processing metrics (number of rows processed/updated/inserted, number of micro-batches etc etc) in a PySpark/SQL code based notebook irrespective of the fact it uses auto-loader, structured streaming, DLT etc.
At the moment, we do profile the tables using "desc history" command and capture these as a reactive step. But, I like to achieve the same in real-time, and prepare an operation dashboard so it is useful for operators to pro-actively find those tables that have high latency on a particular day ((could be sudden increase in volume of records impacting a whole bunch of other tables))
Appreciate your thoughts.
Br,
Noor.