cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
Hari_Vignesh_R
Databricks Partner

Custom name in databricks.png

Introduction

Modern data platforms rely heavily on distributed processing frameworks like Apache Spark and Databricks to efficiently process massive datasets. While this distributed architecture enables high scalability and performance, it also introduces certain operational challenges — particularly when it comes to managing output files.

One of the most common situations data engineer’s encounters is when writing a DataFrame to storage systems such as Amazon S3 or Azure Data Lake Storage. Instead of producing a single file, Spark generates multiple part-* files inside a directory. Although this behavior aligns perfectly with Spark’s distributed design, it can create confusion and inefficiencies in real-world workflows.

In many business scenarios, teams require a single output file with a meaningful name — whether it is for reporting, sharing data with external partners, feeding legacy systems, or enabling simple manual validation. Searching through directories to locate the correct part-* file is not only inconvenient but also introduces unnecessary operational overhead.

In this article, we will explore a simple yet effective pattern for generating a single Parquet or CSV file with a business-friendly name in Databricks. This approach preserves the scalability of Spark processing while delivering cleaner, more predictable outputs that are easier for both engineers and business users to consume.

Problem Statement

In Databricks and Spark-based systems, writing data using the default behavior often results in multiple part-* files with system-generated names. This creates operational challenges for downstream consumers. Teams frequently struggle with identifying the correct output file, managing file handoffs, implementing reprocessing logic, and supporting automation workflows that expect a single, consistently named file.

This becomes especially problematic when files are shared with external teams, consumed by legacy systems, or manually validated. In such cases, users are forced to inspect folders and identify the correct part-* file instead of directly accessing a meaningful file name.

Writing a single Parquet or CSV file with a business-friendly name helps address these issues by simplifying data consumption, improving traceability, and reducing operational overhead. It enables predictable file paths for automation, easier debugging and validation, cleaner handoffs to stakeholders, and better alignment with enterprise data governance practices — while still leveraging Spark’s distributed processing capabilities upstream.

If you’ve worked with Apache Spark / Databricks, you’ve likely encountered this scenario:

You write a DataFrame to S3 or ADLS

Spark creates a folder:

summary_variables/
 ├── part-00000-3c4a.snappy.parquet
 ├── _SUCCESS

This is Spark’s default behavior — it is distributed by design.

However, in many real-world scenarios, this becomes problematic:

  • Hard to identify what data was written
  • Painful for manual validation
  • Confusing for downstream users
  • Not ideal for reports, extracts, or partner deliveries

What we actually want is something like:

db_user_data_variables_20250118_143215.parquet

A single file with a clear, meaningful name that can be easily consumed by both systems and users.

Why Spark Creates Part Files

Spark always writes data in parallel.
 Each executor writes its own output → hence multiple part-* files.

Even if your DataFrame is small, Spark does not know that upfront.

The Idea: Write → Rename → Clean-up

The trick is simple and reliable:

  1. Force Spark to write a single part file
  2. Write to a temporary directory
  3. Rename the generated part-* file
  4. Delete the temporary directory

This gives you:

  • A single file
  • A meaningful name
  • Full compatibility with Spark

Solution for Parquet (Recommended for Analytics)

Step-by-Step Implementation:

import datetime

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_db"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.parquet"

# Step 1: Write Spark DataFrame to temporary path
(
    result_spark_df
    .coalesce(1)           # Force single part file
    .write
    .mode("overwrite")
    .parquet(tmp_path)
)

# Step 2: Find the generated part file
part_file = [
    f.path for f in dbutils.fs.ls(tmp_path)
    if f.name.startswith("part-")
][0]

# Step 3: Rename the file to a meaningful name
dbutils.fs.mv(part_file, f"{final_path}/{final_file}")

# Step 4: Clean up temp directory
dbutils.fs.rm(tmp_path, recurse=True)

print(f"Parquet written successfully: {final_path}{final_file}")

Why This Works

  • .coalesce(1) ensures only one executor writes output
  • Spark still writes a part-* file
  • dbutils.fs.mv() renames it cleanly
  • Temporary directory is removed → no clutter                                                           

Solution for single CSV File with a Meaningful Name

Step-by-Step Implementation:

import datetime

timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')

tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_cb_csv"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.csv"

(
    result_spark_df
    .coalesce(1)
    .write
    .mode("overwrite")
    .option("header", "true")
    .csv(tmp_path)
)

part_file = [
    f.path for f in dbutils.fs.ls(tmp_path)
    if f.name.startswith("part-")
][0]

dbutils.fs.mv(part_file, f"{final_path}/{final_file}")
dbutils.fs.rm(tmp_path, recurse=True)

print(f"CSV written successfully: {final_path}{final_file}")

Supported Formats

This approach works for any Spark file format:

Format Supported:

Parquet, CSV, JSON, ORC, Avro.

When Should You Use This?

Use coalesce(1) only when your goal is to generate a single output file for convenience—not for performance.

Good Use Cases

Use it when you need a single output file for convenience:

  • Reports
  • Partner deliveries
  • Adhoc extracts
  • Debugging & validation
  • BI tool ingestion
  • One-time exports

Avoid Using It For

Avoid in performance-critical or large-scale workloads:

  • Very large datasets
  • High-volume pipelines
  • Streaming outputs

Important Note

coalesce(1) forces all data into a single executor, which can lead to performance bottlenecks or even job failures if the dataset is large.

Key Takeaways

  • Spark always writes part files by design
  • You cannot disable this behavior

What You Can Control

  • Control the number of output files
  • Rename them
  • Clean up the output directory

Benefits You Gain

  • Cleaner storage
  • Better observability
  • Human-readable outputs

Final Thoughts

This is a simple but powerful Databricks pattern that many teams struggle with initially.

Once you adopt this:

  • No more guessing which data was written
  • No more opening random part-* files
  • Cleaner handoffs to business & external teams

If you found this useful, feel free to share it with your Databricks community

Regards,
@Hari_Vignesh_R (Senior Data Engineer)

1 Comment