cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Orphan (?) files on Databricks S3 bucket

alejandrofm
Valued Contributor

Hi, I'm seeing a lot of empty (and not) directories on routes like:

xxxxxx.jobs/FileStore/job-actionstats/

xxxxxx.jobs/FileStore/job-result/

xxxxxx.jobs/command-results/

Can I create a lifecycle to delete old objects (files/directories)? how many days? what is the best practice for this case?

Are there other directories that need a lifecycle configuration?

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

View solution in original post

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

In the Admin console, there are options to clean storage that you can use.

For everything configurable (DB locations, checkpoints), please use your storage control to control it.

image.png

alejandrofm
Valued Contributor

Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?

Thanks!

Hubert-Dudek
Esteemed Contributor III

During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.

You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):

#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}

In the notebook, you can disable logging by using:

sc.setLogLevel("OFF");

Additionally, for cluster config, you can set for delta files:

spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days

Not the best solution, will not implement something like this on prod, but it's the best answer, thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group