- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-17-2025 08:46 PM
Hi all,
I have use case where I need to push the table data to GCS bucket,
query = "${QUERY}"
df = spark.sql(query)
gcs_path = "${GCS_PATH}"
df.write.option("maxRecordsPerFile", int("${MAX_RECORDS_PER_FILE}")).mode("${MODE}").json(gcs_path)This can push the results of the query to GCS, but this is generating some metadata files in the location
'_started_...'
'_committed_..'
I want to avoid this as I can't easily do a post processing in the bucket. Any help is appreciated.
Thanks,
Aswin Vishnu
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-19-2025 09:59 PM
Databricks has a special DBIO protocol that uses the _started and _committed files to transactionally write to cloud storage.
You can disable this by setting the below spark config
spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol")
Also, you can read more about DBIO here
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-20-2025 07:41 AM
Thanks a lot @cgrant . This removed '_started_...' , '_committed_..', but still generated _SUCCESS file.
spark.conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")this removed _SUCCESS files also.