Modern data platforms rely heavily on distributed processing frameworks like Apache Spark and Databricks to efficiently process massive datasets. While this distributed architecture enables high scalability and performance, it also introduces certain operational challenges — particularly when it comes to managing output files.
One of the most common situations data engineer’s encounters is when writing a DataFrame to storage systems such as Amazon S3 or Azure Data Lake Storage. Instead of producing a single file, Spark generates multiple part-* files inside a directory. Although this behavior aligns perfectly with Spark’s distributed design, it can create confusion and inefficiencies in real-world workflows.
In many business scenarios, teams require a single output file with a meaningful name — whether it is for reporting, sharing data with external partners, feeding legacy systems, or enabling simple manual validation. Searching through directories to locate the correct part-* file is not only inconvenient but also introduces unnecessary operational overhead.
In this article, we will explore a simple yet effective pattern for generating a single Parquet or CSV file with a business-friendly name in Databricks. This approach preserves the scalability of Spark processing while delivering cleaner, more predictable outputs that are easier for both engineers and business users to consume.
In Databricks and Spark-based systems, writing data using the default behavior often results in multiple part-* files with system-generated names. This creates operational challenges for downstream consumers. Teams frequently struggle with identifying the correct output file, managing file handoffs, implementing reprocessing logic, and supporting automation workflows that expect a single, consistently named file.
This becomes especially problematic when files are shared with external teams, consumed by legacy systems, or manually validated. In such cases, users are forced to inspect folders and identify the correct part-* file instead of directly accessing a meaningful file name.
Writing a single Parquet or CSV file with a business-friendly name helps address these issues by simplifying data consumption, improving traceability, and reducing operational overhead. It enables predictable file paths for automation, easier debugging and validation, cleaner handoffs to stakeholders, and better alignment with enterprise data governance practices — while still leveraging Spark’s distributed processing capabilities upstream.
If you’ve worked with Apache Spark / Databricks, you’ve likely encountered this scenario:
You write a DataFrame to S3 or ADLS
Spark creates a folder:
summary_variables/
├── part-00000-3c4a.snappy.parquet
├── _SUCCESS
This is Spark’s default behavior — it is distributed by design.
However, in many real-world scenarios, this becomes problematic:
What we actually want is something like:
db_user_data_variables_20250118_143215.parquet
A single file with a clear, meaningful name that can be easily consumed by both systems and users.
Spark always writes data in parallel.
Each executor writes its own output → hence multiple part-* files.
Even if your DataFrame is small, Spark does not know that upfront.
The Idea: Write → Rename → Clean-up
The trick is simple and reliable:
This gives you:
Step-by-Step Implementation:
import datetime
timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_db"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.parquet"
# Step 1: Write Spark DataFrame to temporary path
(
result_spark_df
.coalesce(1) # Force single part file
.write
.mode("overwrite")
.parquet(tmp_path)
)
# Step 2: Find the generated part file
part_file = [
f.path for f in dbutils.fs.ls(tmp_path)
if f.name.startswith("part-")
][0]
# Step 3: Rename the file to a meaningful name
dbutils.fs.mv(part_file, f"{final_path}/{final_file}")
# Step 4: Clean up temp directory
dbutils.fs.rm(tmp_path, recurse=True)
print(f"Parquet written successfully: {final_path}{final_file}")
Why This Works
Step-by-Step Implementation:
import datetime
timestamp = datetime.datetime.now().strftime('%Y%m%d_%H%M%S')
tmp_path = "s3://dl-databricks/credit_report/data_variables/_tmp_cb_csv"
final_path = "s3://dl-databricks/credit_report/data_variables/"
final_file = f"db_user_data_variables_{timestamp}.csv"
(
result_spark_df
.coalesce(1)
.write
.mode("overwrite")
.option("header", "true")
.csv(tmp_path)
)
part_file = [
f.path for f in dbutils.fs.ls(tmp_path)
if f.name.startswith("part-")
][0]
dbutils.fs.mv(part_file, f"{final_path}/{final_file}")
dbutils.fs.rm(tmp_path, recurse=True)
print(f"CSV written successfully: {final_path}{final_file}")
This approach works for any Spark file format:
Format Supported:
Parquet, CSV, JSON, ORC, Avro.
Use coalesce(1) only when your goal is to generate a single output file for convenience—not for performance.
Use it when you need a single output file for convenience:
Avoid in performance-critical or large-scale workloads:
coalesce(1) forces all data into a single executor, which can lead to performance bottlenecks or even job failures if the dataset is large.
This is a simple but powerful Databricks pattern that many teams struggle with initially.
Once you adopt this:
If you found this useful, feel free to share it with your Databricks community
Regards,
@Hari_Vignesh_R (Senior Data Engineer)
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.