โ05-07-2025 01:37 AM
I use bad records while reading a csv as follows:
df = spark.read.format("csv")
.schema(schema)
.option("badRecordsPath", bad_records_path)
โ05-07-2025 05:06 AM
Hey @bjn ,
Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), youโre triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.
So youโre right to be cautious.
Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath thereโs no need to call .write("noop") beforehand.
Best Regards,
Isi
โ05-07-2025 03:28 AM
Hi @bjn ,
If I underestand you correctly
To efficiently trigger the writing of bad records captured via .option("badRecordsPath", ...) without causing memory overhead or driver-side issues, the best option is:
df.write.format("noop").mode("overwrite").save()
This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. Itโs more efficient than using .collect() or even .count(), since it avoids aggregations and doesnโt load data into the driver.
Hope this helps, ๐
Isi
โ05-07-2025 04:39 AM
It helps :). Thank you.
I have two questions to clarify and possibly optimize.
1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by
df.write.format("noop").mode("overwrite").save()
and another one by writing to the table. Is that correct?
2) Does writing to table trigger a full evaluation of a data frame?
I use the following command for writing:
df.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable("table_a")
I could rewrite the control flow to first write data frame to a table and then use the written bad records.
โ05-07-2025 05:06 AM
Hey @bjn ,
Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), youโre triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.
So youโre right to be cautious.
Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath thereโs no need to call .write("noop") beforehand.
Best Regards,
Isi
โ05-20-2025 07:14 AM
data_frame.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable(table_name)
doesn't trigger a bad record write. How is that possible?
โ05-20-2025 07:25 AM
I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi ๐
data_frame.write.format("delta").option("optimizeWrite", "true").mode(
"overwrite"
).saveAsTable(table_name)
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now