Databricks Community

Eyespoop · ‎06-23-2022

Currently I am having some issues with the writing of the parquet file in the Storage Container. I do have the codes running but whenever the dataframe writer puts the parquet to the blob storage instead of the parquet file type, it is created as a folder type with many files content to it.

One note to it, I tried searching various ways in the internets that it seems this is default creation when using pyspark and I can see in the folder created there was a file parquet with a snappy add into it (refer to the screenshots below)

If this is a default creation to pyspark code, how can I write a single parquet format that will do some splitting or creation of folder? any recommendations? on how to do it?

Anonymous · ‎06-24-2022

When you write a file, it uses the default compression if you don't specify it. The default compression is snappy, so that's expected + desired behavior.

Parquet is meant to be splittable. It also needs to create the other files that begin with the underscore to ensure you don't get partial or broken writes.

What exactly are you trying to do?

View solution in original post

Anonymous · ‎06-24-2022

When you write a file, it uses the default compression if you don't specify it. The default compression is snappy, so that's expected + desired behavior.

Parquet is meant to be splittable. It also needs to create the other files that begin with the underscore to ensure you don't get partial or broken writes.

What exactly are you trying to do?

Eyespoop · ‎06-27-2022

Already found out that this is already the behaviour, so to make it work, currently all the wrangled folders are being deleted and the file parquet contents inside the wrangled folder are already being moved outside the folder while renaming it. That's the solution I see from my goal to dump a single parquet file on the container with no wrangled folders.

Thank you @Joseph Kambourakis

User16764241763 · ‎06-27-2022

Hello @Karl Saycon

Can you try setting this config to prevent additional parquet summary and metadata files from being written? The result from dataframe write to storage should be a single file.

https://community.databricks.com/s/question/0D53f00001HKHiNCAX/how-do-i-prevent-success-and-committe...

A combination of below three properties will help to disable writing all the transactional files which start with "_".

We can disable the transaction logs of spark parquet write using "spark.sql.sources.commitProtocolClass = org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol". This will help to disable the "committed<TID>" and "started<TID>" files but still _SUCCESS, _common_metadata and _metadata files will generate.
We can disable the _common_metadata and _metadata files using "parquet.enable.summary-metadata=false".
We can also disable the _SUCCESS file using "mapreduce.fileoutputcommitter.marksuccessfuljobs=false".

amarv · ‎07-28-2025

This is my approach:

from databricks.sdk.runtime import dbutils
from pyspark.sql.types import DataFrame

output_base_url = "abfss://..."


def write_single_parquet_file(df: DataFrame, filename: str):
    print(f"Writing '{filename}.parquet' to ABFS")
    filepath = f"{output_base_url}/{filename}.parquet"
    temp_filepath = f"{output_base_url}/temp_{filename}.parquet"

    # Write a temporary folder containing one Parquet file
    df.repartition(1).write.format("parquet").save(temp_filepath, mode="overwrite")

    # Find the Parquet file in the temporary folder
    files = dbutils.fs.ls(temp_filepath)
    output_file = next(x for x in files if x.name.startswith("part-"))

    # Delete filepath, if it exists
    try:
        dbutils.fs.ls(filepath)
        # Filepath exists
        print(f"Deleting old {filepath}")
        dbutils.fs.rm(filepath, recurse=True)
    except Exception as e:
        if "java.io.FileNotFoundException" not in str(e):
            raise e

    # Move the Parquet file to the final location
    dbutils.fs.mv(output_file.path, filepath)

    # Delete temporary folder
    dbutils.fs.rm(temp_filepath, True)
    print(f"Wrote '{filepath}'")