Databricks Community

aswinvishnu · ‎05-17-2025

Hi all,

I have use case where I need to push the table data to GCS bucket,

query = "${QUERY}"

df = spark.sql(query)

gcs_path = "${GCS_PATH}"

df.write.option("maxRecordsPerFile", int("${MAX_RECORDS_PER_FILE}")).mode("${MODE}").json(gcs_path)

This can push the results of the query to GCS, but this is generating some metadata files in the location
'_started_...'

'_committed_..'

I want to avoid this as I can't easily do a post processing in the bucket. Any help is appreciated.

Thanks,

Aswin Vishnu

cgrant · ‎05-19-2025

Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.

You can disable this by setting the below spark config

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Also, you can read more about DBIO here

View solution in original post

cgrant · ‎05-19-2025

Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.

You can disable this by setting the below spark config

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Also, you can read more about DBIO here

aswinvishnu · ‎05-20-2025

Thanks a lot @cgrant . This removed '_started_...' , '_committed_..', but still generated _SUCCESS file.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

this removed _SUCCESS files also.

Databricks Community

Avoiding metadata information when sending data to GCS

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐