10-04-2022 11:27 AM
Hi, I'm seeing a lot of empty (and not) directories on routes like:
xxxxxx.jobs/FileStore/job-actionstats/
xxxxxx.jobs/FileStore/job-result/
xxxxxx.jobs/command-results/
Can I create a lifecycle to delete old objects (files/directories)? how many days? what is the best practice for this case?
Are there other directories that need a lifecycle configuration?
Thanks!
10-18-2022 07:10 AM
During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.
You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):
#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}
In the notebook, you can disable logging by using:
sc.setLogLevel("OFF");
Additionally, for cluster config, you can set for delta files:
spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days
10-13-2022 08:16 AM
10-13-2022 02:51 PM
Hi! I didn't know that, Purging right now, is there a way to schedule that so logs are retained for less time? Maybe I want to maintain the last 7 days for everything?
Thanks!
10-18-2022 07:10 AM
During office hours, I asked one year ago to add purge cluster logs to API, and it was not even considered. I thought to set selenium to do that.
You can limit logging for the cluster by playing with log4j, for example, to put something like .sh script below on dbfs as the start script for the cluster (you need additionally specify log properties to be adjusted for drivers and executors):
#!/bin/bash
echo "Executing on Driver: $DB_IS_DRIVER"
if [[ $DB_IS_DRIVER = "TRUE" ]]; then
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/driver/log4j.properties"
else
LOG4J_PATH="/home/ubuntu/databricks/spark/dbconf/log4j/executor/log4j.properties"
fi
echo "Adjusting log4j.properties here: ${LOG4J_PATH}"
echo "log4j.<custom-prop>=<value>" >> ${LOG4J_PATH}
In the notebook, you can disable logging by using:
sc.setLogLevel("OFF");
Additionally, for cluster config, you can set for delta files:
spark.databricks.delta.logRetentionDuration 3 days
spark.databricks.delta.deletedFileRetentionDuration 3 days
10-28-2022 08:38 PM
Not the best solution, will not implement something like this on prod, but it's the best answer, thanks!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group