cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Trigger bad records in databricks

bjn
New Contributor III

I use bad records while reading a csv as follows:

df = spark.read.format("csv")
            .schema(schema)
            .option("badRecordsPath", bad_records_path)
 
Since bad records are not written immediately, I want to know how can trigger the write of them efficiently. 
 
Currently, I use df.collect() to trigger the bad records write which cause massive overhead and even out of memory problems, which is not acceptable. 
1 ACCEPTED SOLUTION

Accepted Solutions

Isi
Contributor III

Hey @bjn ,

1)

Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), youโ€™re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.

So youโ€™re right to be cautious.

2)

Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath thereโ€™s no need to call .write("noop") beforehand.

Best Regards,

Isi

View solution in original post

5 REPLIES 5

Isi
Contributor III

Hi @bjn ,

If I underestand you correctly

To efficiently trigger the writing of bad records captured via .option("badRecordsPath", ...) without causing memory overhead or driver-side issues, the best option is:

df.write.format("noop").mode("overwrite").save()

This forces Spark to fully evaluate the DataFrame (including the detection and writing of bad records), but without writing any actual output data. Itโ€™s more efficient than using .collect() or even .count(), since it avoids aggregations and doesnโ€™t load data into the driver.


Hope this helps, ๐Ÿ™‚

Isi

bjn
New Contributor III

It helps :). Thank you.

I have two questions to clarify and possibly optimize.

1) Since I write the data frame to a table later, I'm wondering if there is again a full evaluation of the DataFrame. Consequently, there are two full evaluations, one triggered by 

df.write.format("noop").mode("overwrite").save()

and another one by writing to the table. Is that correct?

 

2) Does writing to table trigger a full evaluation of a data frame?

I use the following command for writing:

df.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable("table_a")

I could rewrite the control flow to first write data frame to a table and then use the written bad records.

Isi
Contributor III

Hey @bjn ,

1)

Yes, if you run both df.write.format("noop")... and df.write.format("delta").saveAsTable(...), youโ€™re triggering two separate actions, and Spark will evaluate the DataFrame twice. That includes parsing the CSV and, importantly, processing bad records each time.

So youโ€™re right to be cautious.

2)

Yes, writing to a table is a full action. It will force Spark to read, parse, and materialize all records in the DataFrame, and it will also trigger bad record handling. So this alone is sufficient to force the write to badRecordsPath thereโ€™s no need to call .write("noop") beforehand.

Best Regards,

Isi

bjn
New Contributor III
    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)

doesn't trigger a bad record write. How is that possible?

bjn
New Contributor III

I found the problem why the code didn't trigger the bad records write. I did empty the folder for bad records. After fixing that, it works. Thanks for the help Isi ๐Ÿ™‚

    data_frame.write.format("delta").option("optimizeWrite", "true").mode(
        "overwrite"
    ).saveAsTable(table_name)

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now