Saving PySpark standard out and standard error logs to cloud object storage
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-05-2022 11:01 AM
I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.
When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?
- Labels:
-
Azure databricks
-
Pyspark
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-06-2022 06:09 AM
You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport
You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).
My blog: https://databrickster.medium.com/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-08-2022 08:28 AM
This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook
from IPython.utils.capture import CapturedIO
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdoutI am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.
It would be good to see a working example supporting the @Hubert Dudek 's REST API approach that he mentioned above.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-17-2022 10:55 PM
This does not work for databricks runtime 11.0.
Code:
from IPython.utils.capture import CapturedIO
import sys
capture = CapturedIO(sys.stdout, sys.stderr)
print("asdfghjkjhgf")
cmem = capture.stdout
print(cmem)
Output:
asdfghjkjhgf
AttributeError: 'OutStream' object has no attribute 'getvalue'