05-07-2025 01:37 AM
I use bad records while reading a csv as follows:
df = spark.read.format("csv")
.schema(schema)
.option("badRecordsPath", bad_records_path)
05-07-2025 05:06 AM
Hey @bjn ,
Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), you’re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.
So you’re right to be cautious.
Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath there’s no need to call .write("noop") beforehand.
Best Regards,
Isi
05-07-2025 03:28 AM
Hi @bjn ,
If I underestand you correctly
To efficiently trigger the writing of bad records captured via .option("badRecordsPath", ...) without causing memory overhead or driver-side issues, the best option is:
df.write.format("noop").mode("overwrite").save()
This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. It’s more efficient than using .collect() or even .count(), since it avoids aggregations and doesn’t load data into the driver.
Hope this helps, 🙂
Isi
05-07-2025 04:39 AM
It helps :). Thank you.
I have two questions to clarify and possibly optimize.
1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by
df.write.format("noop").mode("overwrite").save()
and another one by writing to the table. Is that correct?
2) Does writing to table trigger a full evaluation of a data frame?
I use the following command for writing:
df.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable("table_a")
I could rewrite the control flow to first write data frame to a table and then use the written bad records.
05-07-2025 05:06 AM
Hey @bjn ,
Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), you’re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.
So you’re right to be cautious.
Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath there’s no need to call .write("noop") beforehand.
Best Regards,
Isi
05-20-2025 07:14 AM
data_frame.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable(table_name)
doesn't trigger a bad record write. How is that possible?
05-20-2025 07:25 AM
I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi 🙂
data_frame.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable(table_name)
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now