cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Orphan (?) files on Databricks S3 bucket

alejandrofm
Valued Contributor

Hi, I'm seeing a lot of empty (and not) directories on routes like:

xxxxxx.jobs/FileStore/job-actionstats/

xxxxxx.jobs/FileStore/job-result/

xxxxxx.jobs/command-results/

Can I create a lifecycle to delete old objects (files/directories)? how many days? what is the best practice for this case?

Are there other directories that need a lifecycle configuration?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days


My blog: https://databrickster.medium.com/

View solution in original post

4 REPLIES 4

Hubert-Dudek
Databricks MVP

In the Admin console, there are options to clean storage that you can use.

For everything configurable (DB locations, checkpoints), please use your storage control to control it.

image.png


My blog: https://databrickster.medium.com/

alejandrofm
Valued Contributor

Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?

Thanks!

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days


My blog: https://databrickster.medium.com/

Not the best solution, will not implement something like this on prod, but it's the best answer, thanks!