topic Re: Trigger bad records in databricks in Data Engineering

Trigger bad records in databricks

bjn — Wed, 07 May 2025 08:37:32 GMT

I use bad records while reading a csv as follows:

df = spark.read.format("csv") .schema(schema) .option("badRecordsPath", bad_records_path)

Since bad records are not written immediately, I want to know how can trigger the write of them efficiently.

Currently, I use df.collect() to trigger the bad records write which cause massive overhead and even out of memory problems, which is not acceptable.

Re: Trigger bad records in databricks

Isi — Wed, 07 May 2025 10:28:50 GMT

Hi @bjn ,

If I underestand you correctly

To efficiently trigger the writing of bad records captured via .option("badRecordsPath", ...) without causing memory overhead or driver-side issues, the best option is:

df.write.format("noop").mode("overwrite").save()

This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. It’s more efficient than using .collect() or even .count(), since it avoids aggregations and doesn’t load data into the driver.

Hope this helps, 🙂

Isi

Re: Trigger bad records in databricks

bjn — Wed, 07 May 2025 11:39:07 GMT

It helps :). Thank you.

I have two questions to clarify and possibly optimize.

1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by

df.write.format("noop").mode("overwrite").save()

and another one by writing to the table. Is that correct?

2) Does writing to table trigger a full evaluation of a data frame?

I use the following command for writing:

df.write.format("delta").option("optimizeWrite", "true").mode( "overwrite" ).saveAsTable("table_a")

I could rewrite the control flow to first write data frame to a table and then use the written bad records.

Re: Trigger bad records in databricks

Isi — Wed, 07 May 2025 12:06:40 GMT

Hey @bjn ,

1)

Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), you’re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.

So you’re right to be cautious.

2)

Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath there’s no need to call .write("noop") beforehand.

Best Regards,

Isi

Re: Trigger bad records in databricks

bjn — Tue, 20 May 2025 14:14:50 GMT

data_frame.write.format("delta").option("optimizeWrite", "true").mode( "overwrite" ).saveAsTable(table_name)

doesn't trigger a bad record write. How is that possible?

Re: Trigger bad records in databricks

bjn — Tue, 20 May 2025 14:25:10 GMT

I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi 🙂

data_frame.write.format("delta").option("optimizeWrite", "true").mode( "overwrite" ).saveAsTable(table_name)