topic Re: Saving PySpark standard out and standard error logs to cloud object storage in Data Engineering

Saving PySpark standard out and standard error logs to cloud object storage

sage5616 — Tue, 05 Jul 2022 18:01:48 GMT

I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.

When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?

Re: Saving PySpark standard out and standard error logs to cloud object storage

Hubert-Dudek — Wed, 06 Jul 2022 13:09:34 GMT

You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport

You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).

Re: Saving PySpark standard out and standard error logs to cloud object storage

sage5616 — Fri, 08 Jul 2022 15:28:18 GMT

This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook

from IPython.utils.capture import CapturedIO   
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdout

I am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.

It would be good to see a working example supporting the @Hubert Dudek 's REST API approach that he mentioned above.

Re: Saving PySpark standard out and standard error logs to cloud object storage

dasroya — Fri, 18 Nov 2022 06:55:19 GMT

This does not work for databricks runtime 11.0.

Code:

from IPython.utils.capture import CapturedIO

import sys

capture = CapturedIO(sys.stdout, sys.stderr)

print("asdfghjkjhgf")

cmem = capture.stdout

print(cmem)

Output:

asdfghjkjhgf

AttributeError: 'OutStream' object has no attribute 'getvalue'