Databricks Community

alejandrofm · ‎10-04-2022

Hi, I'm seeing a lot of empty (and not) directories on routes like:

xxxxxx.jobs/FileStore/job-actionstats/

xxxxxx.jobs/FileStore/job-result/

xxxxxx.jobs/command-results/

Can I create a lifecycle to delete old objects (files/directories)? how many days? what is the best practice for this case?

Are there other directories that need a lifecycle configuration?

Thanks!

Hubert-Dudek · ‎10-18-2022

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

View solution in original post

Hubert-Dudek · ‎10-13-2022

In the Admin console, there are options to clean storage that you can use.

For everything configurable (DB locations, checkpoints), please use your storage control to control it.

alejandrofm · ‎10-13-2022

Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?

Thanks!

Hubert-Dudek · ‎10-18-2022

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

alejandrofm · ‎10-28-2022

Not the best solution, will not implement something like this on prod, but it's the best answer, thanks!

Databricks Community

Orphan (?) files on Databricks S3 bucket

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs