Databricks Community

ccsalt · 3 weeks ago

Saving logs from an all-purpose cluster to Volume or S3 is not consistent, because stderr, stdout, and log4j-active.log get overwritten when the cluster is restarted between minutes 01 and 59.

Tested case:
A job is configured to start every 20 minutes, for example: 10:10 -> 10:30 -> 10:50.
Cluster logs (stderr, stdout, log4j-active.log) are overwritten at each restart, because the cluster does not reach the exact hour (10:00, 11:00) when automatic log rotation happens.

In the Databricks UI, the logs appear later, but with the same name (for example log4j-active.log).
The issue is that, although they seem visible in the UI, in Volume the files are overwritten and information is lost.

Does anyone have an idea of how I can still preserve all logs?

Thanks!

aleksandra_ch · 3 weeks ago

Hi @ccsalt ,

This is a known limitation. Log rotation (renaming to log4j-YYYY-MM-DD-HH.log.gz) only happens on the hour boundary. The active log file log4j-active.log has always the same name and is overwritten if a cluster restart happens within one hour.

As a workaround:

Use a cluster-scoped init script to rename active log file before the cluster starts;
Or switch to a job cluster, if possible. Each run will write its logs to a separate folder.

Hope it helps.

Best regards,

ccsalt · 3 weeks ago

Hi @aleksandra_ch

Do you happen to have an example for the workaround “Use a cluster-scoped init script to rename the active log file before the cluster starts”?

I’ve tried many variants, but without success. I keep running into an “operation not permitted” error because the Volumes directory is empty at startup (more precisely, during the init script phase).

I also tried using /databricks/driver/, but the logs directory is not available until about 1 minute after startup. When it does become available, it only contains stdout and stderr. By the time log4j-active.log appears, but it has already been reset.

Thank you!

aleksandra_ch · 3 weeks ago

Hi @ccsalt ,

Can you try this one:

#!/bin/bash
# Cluster log preservation init script.
#
# Cluster log delivery overwrites stdout/stderr/log4j-active.log/stacktrace.log
# on every cluster restart. This script runs at startup and makes a timestamped
# copy of those files so the previous session's logs are kept.
#
# To use: edit LOG_BASE below to match your cluster's cluster_log_conf destination,
# then attach this script as an init script.

set -uo pipefail

# >>> EDIT THIS to your cluster_log_conf destination <<<
LOG_BASE="/Volumes/<catalog>/<schema>/<volume>/<subdir>"

TS="$(date -u +%Y%m%dT%H%M%SZ)"
DRIVER_LOG_DIR="${LOG_BASE}/${DB_CLUSTER_ID}/driver"

echo "[preserve_logs] cluster=${DB_CLUSTER_ID} ts=${TS}"
echo "[preserve_logs] checking ${DRIVER_LOG_DIR}"

if [ ! -d "${DRIVER_LOG_DIR}" ]; then
  echo "[preserve_logs] no prior driver log dir, nothing to preserve"
  exit 0
fi

for f in stdout stderr log4j-active.log stacktrace.log; do
  src="${DRIVER_LOG_DIR}/${f}"
  dst="${DRIVER_LOG_DIR}/${f}.preserved-${TS}"
  if [ -f "${src}" ]; then
    cp "${src}" "${dst}" && echo "[preserve_logs] preserved ${f} -> $(basename ${dst})"
  fi
done

echo "[preserve_logs] done"

Best regards,

ccsalt · 2 weeks ago

Hi @aleksandra_ch,

Unfortunately, the Volumes directory is only accessible through the Databricks interface. At the OS level, it is not accessible, even when running as root (FUSE limitation).

I have also observed intermittent periods where, for certain job runs, logs are not updated in Volumes at all, specifically they are not copied from /databricks/driver/logs to Volumes.
To mitigate this, I implemented an alternative approach that performs direct log synchronization to S3 (the same backing location used by the Volume) at a predefined interval.

It would be very helpful to have either:

A shutdown_script feature as an alternative to init_script, or
A built-in log synchronization process executed automatically before cluster termination.

Please find the script used below:

#!/bin/bash
set -e

# ===============================================================
# 1. BASIC CONFIGURATION (S3)
# ===============================================================
S3_BUCKET="<<bucket>>"
S3_BASE_PREFIX="data/<<catalog>>/<<schemas>>/__unitystorage/schemas/<<schema_id>>/volumes/<<volume_id>>"

# Automatically fetch cluster metadata at startup
CLUSTER_ID="${DB_CLUSTER_ID:-unknown_cluster}"
DT=$(date +%Y-%m-%d)
HH=$(date +%H)
MM=$(date +%M)

# ===============================================================
# 2. CREATE THE PYTHON UTILITY FOR DIRECT S3 SYNC
# ===============================================================
cat << 'EOF' > /usr/local/bin/sync_single_run.py
import boto3
import os
import sys

bucket_name = sys.argv[1]
base_prefix = sys.argv[2]
cluster_id = sys.argv[3]
dt = sys.argv[4]
hh = sys.argv[5]
mm = sys.argv[6]

s3_target_prefix = f"{base_prefix}/{cluster_id}/driver"
local_log_dir = "/databricks/driver/logs"
log_files = ["stdout", "stderr", "log4j-active.log", "stacktrace.log"]

s3 = boto3.client('s3')

def get_s3_name(file_name):
    if file_name == "log4j-active.log": return f"log4j-{dt}-{hh}-{mm}.log"
    if file_name == "stacktrace.log": return f"{dt}-{hh}-{mm}.stacktrace.log"
    if file_name == "stdout": return f"stdout--{dt}--{hh}-{mm}.log"
    if file_name == "stderr": return f"stderr--{dt}--{hh}-{mm}.log"
    return file_name

for file_name in log_files:
    local_path = os.path.join(local_log_dir, file_name)
    if os.path.exists(local_path) and os.path.getsize(local_path) > 0:
        s3_key = f"{s3_target_prefix}/{get_s3_name(file_name)}"
        try:
            s3.upload_file(local_path, bucket_name, s3_key)
        except:
            pass  # Ignore temporary errors to avoid blocking script execution
EOF

chmod +x /usr/local/bin/sync_single_run.py

# ===============================================================
# 3. CREATE AND START THE BASH DAEMON (LIVE SYNC WATCHER)
# ===============================================================
cat << EOF > /tmp/run_daemon.sh
#!/bin/bash
# Let the cluster finish its initial boot phase before the first sync
sleep 30

while true; do
  python3 /usr/local/bin/sync_single_run.py "$S3_BUCKET" "$S3_BASE_PREFIX" "$CLUSTER_ID" "$DT" "$HH" "$MM" > /dev/null 2>&1
    sleep 300  # Run every 300 seconds
done
EOF

chmod +x /tmp/run_daemon.sh

# Launch the daemon in the background as an independent process
nohup /bin/bash /tmp/run_daemon.sh > /dev/null 2>&1 &

echo "The permanent Live Sync system (10s) with Databricks pattern was installed successfully!"

Databricks Community

Inconsistent Cluster Log Persistence to Volume/S3 (stderr, stdout, log4j-active.log)

DAIS 2026 Speaker Spotlight Series #19 | Erin Butler

Solution Accelerator Series | Large Language Models (LLMs) for Customer Service Analytics

🌟 Community Pulse: Your Weekly Roundup! June 01 – 07, 2026

FREE TRAINING: Databricks Business Impact Accelerator

FLASH SALE: Save 50% on Summit Training ⚡