Avoiding metadata information when sending data to GCS

aswinvishnu
New Contributor II

Hi all,

I have use case where I need to push the table data to GCS bucket,

query = "${QUERY}"

df = spark.sql(query)

gcs_path = "${GCS_PATH}"

df.write.option("maxRecordsPerFile", int("${MAX_RECORDS_PER_FILE}")).mode("${MODE}").json(gcs_path)

This can push the results of the query to GCS, but this is generating some metadata files in the location
'_started_...'

'_committed_..'

I want to avoid this as I can't easily do a post processing in the bucket. Any help is appreciated.

Thanks,

Aswin Vishnu

cgrant
Databricks Employee
Databricks Employee

Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.

You can disable this by setting the below spark config

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Also, you can read more about DBIO here

View solution in original post

aswinvishnu
New Contributor II

Thanks a lot @cgrant . This removed   '_started_...' , '_committed_..', but still generated _SUCCESS file.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

this removed _SUCCESS files also.