3 weeks ago
Saving logs from an all-purpose cluster to Volume or S3 is not consistent, because stderr, stdout, and log4j-active.log get overwritten when the cluster is restarted between minutes 01 and 59.
Tested case:
A job is configured to start every 20 minutes, for example: 10:10 -> 10:30 -> 10:50.
Cluster logs (stderr, stdout, log4j-active.log) are overwritten at each restart, because the cluster does not reach the exact hour (10:00, 11:00) when automatic log rotation happens.
In the Databricks UI, the logs appear later, but with the same name (for example log4j-active.log).
The issue is that, although they seem visible in the UI, in Volume the files are overwritten and information is lost.
Does anyone have an idea of how I can still preserve all logs?
Thanks!
3 weeks ago
Hi @ccsalt ,
This is a known limitation. Log rotation (renaming to log4j-YYYY-MM-DD-HH.log.gz) only happens on the hour boundary. The active log file log4j-active.log has always the same name and is overwritten if a cluster restart happens within one hour.
As a workaround:
Hope it helps.
Best regards,
3 weeks ago - last edited 3 weeks ago
Do you happen to have an example for the workaround “Use a cluster-scoped init script to rename the active log file before the cluster starts”?
I’ve tried many variants, but without success. I keep running into an “operation not permitted” error because the Volumes directory is empty at startup (more precisely, during the init script phase).
I also tried using /databricks/driver/, but the logs directory is not available until about 1 minute after startup. When it does become available, it only contains stdout and stderr. By the time log4j-active.log appears, but it has already been reset.
Thank you!
3 weeks ago
Hi @ccsalt ,
Can you try this one:
#!/bin/bash
# Cluster log preservation init script.
#
# Cluster log delivery overwrites stdout/stderr/log4j-active.log/stacktrace.log
# on every cluster restart. This script runs at startup and makes a timestamped
# copy of those files so the previous session's logs are kept.
#
# To use: edit LOG_BASE below to match your cluster's cluster_log_conf destination,
# then attach this script as an init script.
set -uo pipefail
# >>> EDIT THIS to your cluster_log_conf destination <<<
LOG_BASE="/Volumes/<catalog>/<schema>/<volume>/<subdir>"
TS="$(date -u +%Y%m%dT%H%M%SZ)"
DRIVER_LOG_DIR="${LOG_BASE}/${DB_CLUSTER_ID}/driver"
echo "[preserve_logs] cluster=${DB_CLUSTER_ID} ts=${TS}"
echo "[preserve_logs] checking ${DRIVER_LOG_DIR}"
if [ ! -d "${DRIVER_LOG_DIR}" ]; then
echo "[preserve_logs] no prior driver log dir, nothing to preserve"
exit 0
fi
for f in stdout stderr log4j-active.log stacktrace.log; do
src="${DRIVER_LOG_DIR}/${f}"
dst="${DRIVER_LOG_DIR}/${f}.preserved-${TS}"
if [ -f "${src}" ]; then
cp "${src}" "${dst}" && echo "[preserve_logs] preserved ${f} -> $(basename ${dst})"
fi
done
echo "[preserve_logs] done"
Best regards,
2 weeks ago
Hi @aleksandra_ch,
Unfortunately, the Volumes directory is only accessible through the Databricks interface. At the OS level, it is not accessible, even when running as root (FUSE limitation).
I have also observed intermittent periods where, for certain job runs, logs are not updated in Volumes at all, specifically they are not copied from /databricks/driver/logs to Volumes.
To mitigate this, I implemented an alternative approach that performs direct log synchronization to S3 (the same backing location used by the Volume) at a predefined interval.
It would be very helpful to have either:
#!/bin/bash
set -e
# ===============================================================
# 1. BASIC CONFIGURATION (S3)
# ===============================================================
S3_BUCKET="<<bucket>>"
S3_BASE_PREFIX="data/<<catalog>>/<<schemas>>/__unitystorage/schemas/<<schema_id>>/volumes/<<volume_id>>"
# Automatically fetch cluster metadata at startup
CLUSTER_ID="${DB_CLUSTER_ID:-unknown_cluster}"
DT=$(date +%Y-%m-%d)
HH=$(date +%H)
MM=$(date +%M)
# ===============================================================
# 2. CREATE THE PYTHON UTILITY FOR DIRECT S3 SYNC
# ===============================================================
cat << 'EOF' > /usr/local/bin/sync_single_run.py
import boto3
import os
import sys
bucket_name = sys.argv[1]
base_prefix = sys.argv[2]
cluster_id = sys.argv[3]
dt = sys.argv[4]
hh = sys.argv[5]
mm = sys.argv[6]
s3_target_prefix = f"{base_prefix}/{cluster_id}/driver"
local_log_dir = "/databricks/driver/logs"
log_files = ["stdout", "stderr", "log4j-active.log", "stacktrace.log"]
s3 = boto3.client('s3')
def get_s3_name(file_name):
if file_name == "log4j-active.log": return f"log4j-{dt}-{hh}-{mm}.log"
if file_name == "stacktrace.log": return f"{dt}-{hh}-{mm}.stacktrace.log"
if file_name == "stdout": return f"stdout--{dt}--{hh}-{mm}.log"
if file_name == "stderr": return f"stderr--{dt}--{hh}-{mm}.log"
return file_name
for file_name in log_files:
local_path = os.path.join(local_log_dir, file_name)
if os.path.exists(local_path) and os.path.getsize(local_path) > 0:
s3_key = f"{s3_target_prefix}/{get_s3_name(file_name)}"
try:
s3.upload_file(local_path, bucket_name, s3_key)
except:
pass # Ignore temporary errors to avoid blocking script execution
EOF
chmod +x /usr/local/bin/sync_single_run.py
# ===============================================================
# 3. CREATE AND START THE BASH DAEMON (LIVE SYNC WATCHER)
# ===============================================================
cat << EOF > /tmp/run_daemon.sh
#!/bin/bash
# Let the cluster finish its initial boot phase before the first sync
sleep 30
while true; do
python3 /usr/local/bin/sync_single_run.py "$S3_BUCKET" "$S3_BASE_PREFIX" "$CLUSTER_ID" "$DT" "$HH" "$MM" > /dev/null 2>&1
sleep 300 # Run every 300 seconds
done
EOF
chmod +x /tmp/run_daemon.sh
# Launch the daemon in the background as an independent process
nohup /bin/bash /tmp/run_daemon.sh > /dev/null 2>&1 &
echo "The permanent Live Sync system (10s) with Databricks pattern was installed successfully!"