Databricks Community

bjn · ‎05-07-2025

I use bad records while reading a csv as follows:

df = spark.read.format("csv")
            .schema(schema)
            .option("badRecordsPath", bad_records_path)

Since bad records are not written immediately, I want to know how can trigger the write of them efficiently.

Currently, I use df.collect() to trigger the bad records write which cause massive overhead and even out of memory problems, which is not acceptable.

Isi · ‎05-07-2025

Hey @bjn ,

1)

Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), you’re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.

So you’re right to be cautious.

2)

Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath there’s no need to call .write("noop") beforehand.

Best Regards,

Isi

View solution in original post

Isi · ‎05-07-2025

Hi @bjn ,

If I underestand you correctly

To efficiently trigger the writing of bad records captured via .option("badRecordsPath", ...) without causing memory overhead or driver-side issues, the best option is:

df.write.format("noop").mode("overwrite").save()

This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. It’s more efficient than using .collect() or even .count(), since it avoids aggregations and doesn’t load data into the driver.

Hope this helps, 🙂

Isi

bjn · ‎05-07-2025

It helps :). Thank you.

I have two questions to clarify and possibly optimize.

1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by

df.write.format("noop").mode("overwrite").save()

and another one by writing to the table. Is that correct?

2) Does writing to table trigger a full evaluation of a data frame?

I use the following command for writing:

df.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable("table_a")

I could rewrite the control flow to first write data frame to a table and then use the written bad records.

Isi · ‎05-07-2025

Hey @bjn ,

1)

Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), you’re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.

So you’re right to be cautious.

2)

Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath there’s no need to call .write("noop") beforehand.

Best Regards,

Isi

bjn · ‎05-20-2025

    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)

doesn't trigger a bad record write. How is that possible?

bjn · ‎05-20-2025

I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi 🙂

    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)

Databricks Community

Trigger bad records in databricks

1)

2)

1)

2)

Join Us as a Local Community Builder!

🎬 Databricks Community 2025 Highlights | A Year, Built Together

🌟 Community Pulse: Your Weekly Roundup! December 22, 2025 – January 04, 2026

Solution Accelerator Series | Scale cybersecurity analytics with Splunk and Databricks

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Self-Paced Learning Festival: 09 January - 30 January 2026