topic Re: Avoiding metadata information when sending data to GCS in Data Engineering

Avoiding metadata information when sending data to GCS

aswinvishnu — Sun, 18 May 2025 03:46:45 GMT

Hi all,

I have use case where I need to push the table data to GCS bucket,

query = "${QUERY}" df = spark.sql(query) gcs_path = "${GCS_PATH}" df.write.option("maxRecordsPerFile", int("${MAX_RECORDS_PER_FILE}")).mode("${MODE}").json(gcs_path)

This can push the results of the query to GCS, but this is generating some metadata files in the location
'_started_...'

'_committed_..'

I want to avoid this as I can't easily do a post processing in the bucket. Any help is appreciated.

Thanks,

Aswin Vishnu

Re: Avoiding metadata information when sending data to GCS

cgrant — Tue, 20 May 2025 04:59:05 GMT

Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.

You can disable this by setting the below spark config

spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")

Also, you can read more about DBIO here

Re: Avoiding metadata information when sending data to GCS

aswinvishnu — Tue, 20 May 2025 14:41:52 GMT

Thanks a lot @cgrant . This removed '_started_...' , '_committed_..', but still generated _SUCCESS file.

spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

this removed _SUCCESS files also.