Databricks Community

Hari_Vignesh_R

Introduction
Problem Statement
Why Spark Creates Part Files
Solution for Parquet (Recommended for Analytics)
Solution for single CSV File with a Meaningful Name
Supported Formats
When Should You Use This?
Good Use Cases
Avoid Using It For
Important Note
Key Takeaways
What You Can Control
Benefits You Gain
Final Thoughts

Introduction

Modern data platforms rely heavily on distributed processing frameworks like Apache Spark and Databricks to efficiently process massive datasets. While this distributed architecture enables high scalability and performance, it also introduces certain operational challenges — particularly when it comes to managing output files.

One of the most common situations data engineer’s encounters is when writing a DataFrame to storage systems such as Amazon S3 or Azure Data Lake Storage. Instead of producing a single file, Spark generates multiple part-* files inside a directory. Although this behavior aligns perfectly with Spark’s distributed design, it can create confusion and inefficiencies in real-world workflows.

In many business scenarios, teams require a single output file with a meaningful name — whether it is for reporting, sharing data with external partners, feeding legacy systems, or enabling simple manual validation. Searching through directories to locate the correct part-* file is not only inconvenient but also introduces unnecessary operational overhead.

In this article, we will explore a simple yet effective pattern for generating a single Parquet or CSV file with a business-friendly name in Databricks. This approach preserves the scalability of Spark processing while delivering cleaner, more predictable outputs that are easier for both engineers and business users to consume.

Problem Statement

In Databricks and Spark-based systems, writing data using the default behavior often results in multiple part-* files with system-generated names. This creates operational challenges for downstream consumers. Teams frequently struggle with identifying the correct output file, managing file handoffs, implementing reprocessing logic, and supporting automation workflows that expect a single, consistently named file.

This becomes especially problematic when files are shared with external teams, consumed by legacy systems, or manually validated. In such cases, users are forced to inspect folders and identify the correct part-* file instead of directly accessing a meaningful file name.

Writing a single Parquet or CSV file with a business-friendly name helps address these issues by simplifying data consumption, improving traceability, and reducing operational overhead. It enables predictable file paths for automation, easier debugging and validation, cleaner handoffs to stakeholders, and better alignment with enterprise data governance practices — while still leveraging Spark’s distributed processing capabilities upstream.

If you’ve worked with Apache Spark / Databricks, you’ve likely encountered this scenario:

You write a DataFrame to S3 or ADLS

Spark creates a folder:

summary_variables/
 ├── part-00000-3c4a.snappy.parquet
 ├── _SUCCESS

This is Spark’s default behavior — it is distributed by design.

However, in many real-world scenarios, this becomes problematic:

Hard to identify what data was written
Painful for manual validation
Confusing for downstream users
Not ideal for reports, extracts, or partner deliveries

What we actually want is something like:

db_user_data_variables_20250118_143215.parquet

A single file with a clear, meaningful name that can be easily consumed by both systems and users.

Why Spark Creates Part Files

Spark always writes data in parallel.
Each executor writes its own output → hence multiple part-* files.

Even if your DataFrame is small, Spark does not know that upfront.

The Idea: Write → Rename → Clean-up

The trick is simple and reliable:

Force Spark to write a single part file
Write to a temporary directory
Rename the generated part-* file
Delete the temporary directory

This gives you:

A single file
A meaningful name
Full compatibility with Spark

Solution for Parquet (Recommended for Analytics)

Step-by-Step Implementation:

import datetime

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_db"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.parquet"

# Step 1: Write Spark DataFrame to temporary path
(
    result_spark_df
    .coalesce(1)           # Force single part file
    .write
    .mode("overwrite")
    .parquet(tmp_path)
)

# Step 2: Find the generated part file
part_file = [
    f.path for f in dbutils.fs.ls(tmp_path)
    if f.name.startswith("part-")
][0]

# Step 3: Rename the file to a meaningful name
dbutils.fs.mv(part_file, f"{final_path}/{final_file}")

# Step 4: Clean up temp directory
dbutils.fs.rm(tmp_path, recurse=True)

print(f"Parquet written successfully: {final_path}{final_file}")

Why This Works

.coalesce(1) ensures only one executor writes output
Spark still writes a part-* file
dbutils.fs.mv() renames it cleanly
Temporary directory is removed → no clutter

Solution for single CSV File with a Meaningful Name

Step-by-Step Implementation:

import datetime

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_cb_csv"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.csv"

(
    result_spark_df
    .coalesce(1)
    .write
    .mode("overwrite")
    .option("header", "true")
    .csv(tmp_path)
)

part_file = [
    f.path for f in dbutils.fs.ls(tmp_path)
    if f.name.startswith("part-")
][0]

dbutils.fs.mv(part_file, f"{final_path}/{final_file}")
dbutils.fs.rm(tmp_path, recurse=True)

print(f"CSV written successfully: {final_path}{final_file}")

Supported Formats

This approach works for any Spark file format:

Format Supported:

Parquet, CSV, JSON, ORC, Avro.

When Should You Use This?

Use coalesce(1) only when your goal is to generate a single output file for convenience—not for performance.

Good Use Cases

Use it when you need a single output file for convenience:

Reports
Partner deliveries
Adhoc extracts
Debugging & validation
BI tool ingestion
One-time exports

Avoid Using It For

Avoid in performance-critical or large-scale workloads:

Very large datasets
High-volume pipelines
Streaming outputs

Important Note

coalesce(1) forces all data into a single executor, which can lead to performance bottlenecks or even job failures if the dataset is large.

Key Takeaways

Spark always writes part files by design
You cannot disable this behavior

What You Can Control

Control the number of output files
Rename them
Clean up the output directory

Benefits You Gain

Cleaner storage
Better observability
Human-readable outputs

Final Thoughts

This is a simple but powerful Databricks pattern that many teams struggle with initially.

Once you adopt this:

No more guessing which data was written
No more opening random part-* files
Cleaner handoffs to business & external teams

If you found this useful, feel free to share it with your Databricks community

Regards,
@Hari_Vignesh_R (Senior Data Engineer)