topic Re: Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log) in Data Engineering

Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

ccsalt — Tue, 19 May 2026 09:45:52 GMT

Saving logs from an all-purpose cluster to Volume or S3 is not consistent, because stderr, stdout, and log4j-active.log get overwritten when the cluster is restarted between minutes 01 and 59.

Tested case:
A job is configured to start every 20 minutes, for example: 10:10 -> 10:30 -> 10:50.
Cluster logs (stderr, stdout, log4j-active.log) are overwritten at each restart, because the cluster does not reach the exact hour (10:00, 11:00) when automatic log rotation happens.

In the Databricks UI, the logs appear later, but with the same name (for example log4j-active.log).
The issue is that, although they seem visible in the UI, in Volume the files are overwritten and information is lost.

Does anyone have an idea of how I can still preserve all logs?

Thanks!

Re: Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

aleksandra_ch — Thu, 21 May 2026 14:56:06 GMT

Hi @ccsalt ,

This is a known limitation. Log rotation (renaming to log4j-YYYY-MM-DD-HH.log.gz) only happens on the hour boundary. The active log file log4j-active.log has always the same name and is overwritten if a cluster restart happens within one hour.

As a workaround:

Use a cluster-scoped init script to rename active log file before the cluster starts;
Or switch to a job cluster, if possible. Each run will write its logs to a separate folder.

Hope it helps.

Best regards,

Re: Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

ccsalt — Fri, 22 May 2026 11:26:51 GMT

Hi @aleksandra_ch

Do you happen to have an example for the workaround “Use a cluster-scoped init script to rename the active log file before the cluster starts”?

I’ve tried many variants, but without success. I keep running into an “operation not permitted” error because the Volumes directory is empty at startup (more precisely, during the init script phase).

I also tried using /databricks/driver/, but the logs directory is not available until about 1 minute after startup. When it does become available, it only contains stdout and stderr. By the time log4j-active.log appears, but it has already been reset.

Thank you!

Re: Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

aleksandra_ch — Fri, 22 May 2026 16:58:37 GMT

Hi @ccsalt ,

Can you try this one:

#!/bin/bash # Cluster log preservation init script. # # Cluster log delivery overwrites stdout/stderr/log4j-active.log/stacktrace.log # on every cluster restart. This script runs at startup and makes a timestamped # copy of those files so the previous session's logs are kept. # # To use: edit LOG_BASE below to match your cluster's cluster_log_conf destination, # then attach this script as an init script. set -uo pipefail # >>> EDIT THIS to your cluster_log_conf destination <<< LOG_BASE="/Volumes/<catalog>/<schema>/<volume>/<subdir>" TS="$(date -u +%Y%m%dT%H%M%SZ)" DRIVER_LOG_DIR="${LOG_BASE}/${DB_CLUSTER_ID}/driver" echo "[preserve_logs] cluster=${DB_CLUSTER_ID} ts=${TS}" echo "[preserve_logs] checking ${DRIVER_LOG_DIR}" if [ ! -d "${DRIVER_LOG_DIR}" ]; then echo "[preserve_logs] no prior driver log dir, nothing to preserve" exit 0 fi for f in stdout stderr log4j-active.log stacktrace.log; do src="${DRIVER_LOG_DIR}/${f}" dst="${DRIVER_LOG_DIR}/${f}.preserved-${TS}" if [ -f "${src}" ]; then cp "${src}" "${dst}" && echo "[preserve_logs] preserved ${f} -> $(basename ${dst})" fi done echo "[preserve_logs] done"

Best regards,

Re: Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

ccsalt — Wed, 27 May 2026 05:50:10 GMT

Hi @aleksandra_ch,

Unfortunately, the Volumes directory is only accessible through the Databricks interface. At the OS level, it is not accessible, even when running as root (FUSE limitation).

I have also observed intermittent periods where, for certain job runs, logs are not updated in Volumes at all, specifically they are not copied from /databricks/driver/logs to Volumes.
To mitigate this, I implemented an alternative approach that performs direct log synchronization to S3 (the same backing location used by the Volume) at a predefined interval.

It would be very helpful to have either:

A shutdown_script feature as an alternative to init_script, or
A built-in log synchronization process executed automatically before cluster termination.

Please find the script used below:

#!/bin/bash set -e # =============================================================== # 1. BASIC CONFIGURATION (S3) # =============================================================== S3_BUCKET="<<bucket>>" S3_BASE_PREFIX="data/<<catalog>>/<<schemas>>/__unitystorage/schemas/<<schema_id>>/volumes/<<volume_id>>" # Automatically fetch cluster metadata at startup CLUSTER_ID="${DB_CLUSTER_ID:-unknown_cluster}" DT=$(date +%Y-%m-%d) HH=$(date +%H) MM=$(date +%M) # =============================================================== # 2. CREATE THE PYTHON UTILITY FOR DIRECT S3 SYNC # =============================================================== cat << 'EOF' > /usr/local/bin/sync_single_run.py import boto3 import os import sys bucket_name = sys.argv[1] base_prefix = sys.argv[2] cluster_id = sys.argv[3] dt = sys.argv[4] hh = sys.argv[5] mm = sys.argv[6] s3_target_prefix = f"{base_prefix}/{cluster_id}/driver" local_log_dir = "/databricks/driver/logs" log_files = ["stdout", "stderr", "log4j-active.log", "stacktrace.log"] s3 = boto3.client('s3') def get_s3_name(file_name): if file_name == "log4j-active.log": return f"log4j-{dt}-{hh}-{mm}.log" if file_name == "stacktrace.log": return f"{dt}-{hh}-{mm}.stacktrace.log" if file_name == "stdout": return f"stdout--{dt}--{hh}-{mm}.log" if file_name == "stderr": return f"stderr--{dt}--{hh}-{mm}.log" return file_name for file_name in log_files: local_path = os.path.join(local_log_dir, file_name) if os.path.exists(local_path) and os.path.getsize(local_path) > 0: s3_key = f"{s3_target_prefix}/{get_s3_name(file_name)}" try: s3.upload_file(local_path, bucket_name, s3_key) except: pass # Ignore temporary errors to avoid blocking script execution EOF chmod +x /usr/local/bin/sync_single_run.py # =============================================================== # 3. CREATE AND START THE BASH DAEMON (LIVE SYNC WATCHER) # =============================================================== cat << EOF > /tmp/run_daemon.sh #!/bin/bash # Let the cluster finish its initial boot phase before the first sync sleep 30 while true; do python3 /usr/local/bin/sync_single_run.py "$S3_BUCKET" "$S3_BASE_PREFIX" "$CLUSTER_ID" "$DT" "$HH" "$MM" > /dev/null 2>&1 sleep 300 # Run every 300 seconds done EOF chmod +x /tmp/run_daemon.sh # Launch the daemon in the background as an independent process nohup /bin/bash /tmp/run_daemon.sh > /dev/null 2>&1 & echo "The permanent Live Sync system (10s) with Databricks pattern was installed successfully!"