Saving PySpark standard out and standard error logs to cloud object storage

sage5616
Valued Contributor

I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.

When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?

Hubert-Dudek
Databricks MVP

You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport

You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).


My blog: https://databrickster.medium.com/

sage5616
Valued Contributor

This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook

from IPython.utils.capture import CapturedIO   
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdout

I am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.

It would be good to see a working example supporting the @Hubert Dudek​ 's REST API approach that he mentioned above.

dasroya
New Contributor II

This does not work for databricks runtime 11.0.

Code:

from IPython.utils.capture import CapturedIO

import sys

capture = CapturedIO(sys.stdout, sys.stderr)

print("asdfghjkjhgf")

cmem = capture.stdout

print(cmem)

Output:

asdfghjkjhgf

AttributeError: 'OutStream' object has no attribute 'getvalue'