08-01-2018 09:36 PM
Is there a way to prevent the _success and _committed files in my output. It's a tedious task to navigate to all the partitions and delete the files.
Note : Final output is stored in Azure ADLS
08-03-2018 04:15 AM
This was recommended on StackOverflow though I haven't tested with ADLS yet.
sc._jsc.hadoopConfiguration().set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Note it may impact the whole cluster.
You could also use dbutils.fs.rm step to remove any created files.
cheers,
Andrew
08-07-2018 04:30 AM
This solution is working in my local intellij setup but not with Databricks notebook setup.
08-07-2018 08:46 AM
Did you try with a new Databricks cluster using initialization scripts?
https://docs.databricks.com/user-guide/clusters/init-scripts.html
01-24-2020 04:53 AM
A combination of below three properties will help to disable writing all the transactional files which start with "_".
05-12-2020 07:22 PM
This is very helpful ... Thanks for the information ... Just to add more info to it if somebody wants to disable it at Cluster level for Spark 2.4.5. they can edit the Spark Cluster -) Advanced Options and add above but you need to use <variable> <value> like below :
parquet.enable.summary-metadata false
If you want to add it in databricks notebook you can do like this:
spark.conf.set("parquet.enable.summary-metadata", "false")
06-04-2022 11:57 AM
Please find the below steps to remove _SUCCESS, _committed and _started files.
spark.sql("VACUUM '<file-location>' RETAIN 0 HOURS")
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group