cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Orphan (?) files on Databricks S3 bucket

alejandrofm
Valued Contributor

Hi, I'm seeing a lot of empty (and not) directories on routes like:

xxxxxx.jobs/FileStore/job-actionstats/

xxxxxx.jobs/FileStore/job-result/

xxxxxx.jobs/command-results/

Can I create a lifecycle to delete old objects (files/directories)? how many days? what is the best practice for this case?

Are there other directories that need a lifecycle configuration?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

View solution in original post

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

In the Admin console, there are options to clean storage that you can use.

For everything configurable (DB locations, checkpoints), please use your storage control to control it.

image.png

alejandrofm
Valued Contributor

Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?

Thanks!

Hubert-Dudek
Esteemed Contributor III

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

Not the best solution, will not implement something like this on prod, but it's the best answer, thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.