cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Saving PySpark standard out and standard error logs to cloud object storage

sage5616
Valued Contributor

I am running my PySpark data pipeline code on a standard databricks cluster. I need to save all Python/PySpark standard output and standard error messages into a file in an Azure BLOB account.

When I run my Python code locally I can see all messages including errors in the terminal and save them to a log file. How can something similar be accomplished with Databricks and Azure BLOB for PySpark data pipeline code? Can this be done?

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

You can write a script in which export job output is taken via REST API and save it to BLOB https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsRunsExport

You can also save cluster logging to dbfs in cluster settings, but in REST API, you can get exactly what you need (as you need standard output).

Kaniz
Community Manager
Community Manager

Hi @Michael Okulik​, We haven’t heard from you on the last response from @Hubert Dudek​​, and I was checking back to see if his suggestions helped you. Or else, If you have any solution, please do share that with the community as it can be helpful to others.

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

sage5616
Valued Contributor

This is the approach I am currently taking. It is documented here: https://stackoverflow.com/questions/62774448/how-to-capture-cells-output-in-databricks-notebook

from IPython.utils.capture import CapturedIO   
capture = CapturedIO(sys.stdout, sys.stderr)
...
...
# at the end of desired output:
cmem = capture.stdout

I am writing the contents of cmem variable to a file in BLOB. BLOB is mounted to DBFS.

It would be good to see a working example supporting the @Hubert Dudek​ 's REST API approach that he mentioned above.

dasroya
New Contributor II

This does not work for databricks runtime 11.0.

Code:

from IPython.utils.capture import CapturedIO

import sys

capture = CapturedIO(sys.stdout, sys.stderr)

print("asdfghjkjhgf")

cmem = capture.stdout

print(cmem)

Output:

asdfghjkjhgf

AttributeError: 'OutStream' object has no attribute 'getvalue'

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.