cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Saving PySpark standard out and standard error logs to cloud object storage

sage5616
Valued Contributor

I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.

When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport

You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).

sage5616
Valued Contributor

This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook

from IPython.utils.capture import CapturedIO   
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdout

I am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.

It would be good to see a working example supporting the @Hubert Dudekโ€‹ 's REST API approach that he mentioned above.

dasroya
New Contributor II

This does not work for databricks runtime 11.0.

Code:

from IPython.utils.capture import CapturedIO

import sys

capture = CapturedIO(sys.stdout, sys.stderr)

print("asdfghjkjhgf")

cmem = capture.stdout

print(cmem)

Output:

asdfghjkjhgf

AttributeError: 'OutStream' object has no attribute 'getvalue'

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group