Re: Saving PySpark standard out and standard error...

sage5616 · ‎07-05-2022

I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.

When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?

Hubert-Dudek · ‎07-06-2022

You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport

You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).

My blog: https://databrickster.medium.com/

sage5616 · ‎07-08-2022

This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook

from IPython.utils.capture import CapturedIO   
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdout

I am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.

It would be good to see a working example supporting the @Hubert Dudek 's REST API approach that he mentioned above.

dasroya · ‎11-17-2022

This does not work for databricks runtime 11.0.

Code:

from IPython.utils.capture import CapturedIO

import sys

capture = CapturedIO(sys.stdout, sys.stderr)

print("asdfghjkjhgf")

cmem = capture.stdout

print(cmem)

Output:

asdfghjkjhgf

AttributeError: 'OutStream' object has no attribute 'getvalue'

Saving PySpark standard out and standard error logs to cloud object storage