cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks Autoloader BadRecords path Issue

shan-databricks
New Contributor III

I have one file that has 100 rows and in which two rows are bad data and the remaining 98 rows is good data, but when I use the bad records' path, it completely moves the file to the bad records' path, which has good data as well, and it should move only 2 bad data and 98 good data should load successfully, and also I tried option permissive mode but when we use badRecords path we cannot use any mode it seems and getting error. Please help on the same to resolve the issue.

2 ACCEPTED SOLUTIONS

Accepted Solutions

ShaileshBobay
Databricks Employee
Databricks Employee

Why Entire Files Go to badRecordsPath

When you enable badRecordsPath in Autoloader or in Sparkโ€™s file readers (with formats like CSV/JSON), hereโ€™s what happens:

  • Spark expects each data file to be internally well-formed with respect to the declared schema.

  • If Spark encounters a fatal error while reading an entire fileโ€”for example, due to corrupt encoding, mismatched row/column structure, or invalid file formatโ€”it cannot reliably parse any part of the file.

  • As a result, the entire file is redirected to badRecordsPath, even if most of its content is good, because Spark cannot safely guarantee the integrity of any parsed rows from that file.

  • Per-record handling in badRecordsPath only occurs if Spark can read the file but finds a few faulty rows; when the file cannot be opened or parsed at all, the whole file is marked as "bad."

Typical Root Causes

  • Schema Mismatch: The fileโ€™s structure doesnโ€™t match the schema (e.g., wrong delimiter, extra/missing columns).

  • File Corruption: The file is truncated or not a valid CSV/JSON/Parquet file.

  • Encoding Errors: The fileโ€™s encoding doesnโ€™t match what Spark expects (e.g., UTF-8).

  • Header/Footer Issues: If a file has an unexpected header, footer, or partial content.

So please validate your data file for which you are facing issue and check if you see any of the issue specified above

View solution in original post

I have already analysed the issue and, yes, the schema doesn't match one of the rows, and it moved the complete file into badRecords and I have seen the behavior and that's fine, and thanks for the response.

View solution in original post

9 REPLIES 9

radothede
Valued Contributor II

Hi @shan-databricks ,

have You tried with DROPMALFORMED mode?

Regarding PERMISSIVE mode - could You share a code snippet?

If thats not resolving Your issue, I would recommend using custom try except logic.

 

szymon_dybczak
Esteemed Contributor III

Hi @shan-databricks ,

Maybe try to read it with permissive mode and rescudedDataColumn option? 

spark.read.option("mode", "PERMISSIVE").option("rescuedDataColumn", "_rescued_data").format("csv")

 

Hi @shan-databricks 

You're facing a common issue with Spark's bad records handling.


Read CSV in PERMISSIVE mode and capture corrupt rows.df = spark.read
.option("mode", "PERMISSIVE")
.option("columnNameOfCorruptRecord", "_corrupt_record")
.format("csv")
.load("s3://your-bucket/path/")

later you can filter good and bad records from df.

 

LR

I am using the Autoloader features spark.readStream and writeStream, and have used the option(badRecordsPath), so when I use either Permissive or Dropmalformed or FailFast, I get an exception, like if 'badRecordsPath' is specified mode is not allowed to be set.

I am using the Autoloader features spark.readStream and writeStream, and have used the option(badRecordsPath), so when I use either Permissive or Drop malformed or FailFast, I get an exception, like if 'badRecordsPath' is specified mode is not allowed to be set.

ShaileshBobay
Databricks Employee
Databricks Employee

Hi @shan-databricks , Try with below option

df = ( spark.readStream .format("cloudFiles")

.option("cloudFiles.format", "csv")

.option("badRecordsPath", "/mnt/my-bad-records")

#.option("mode", "PERMISSIVE") # Do NOT set this!

.schema(my_schema) .load("/mnt/data") )

I am using the same in my code, but instead of moving only bad data to badRecordsPath, it is moving complete file into badRecordsPath, which has good data as well in the same file.

ShaileshBobay
Databricks Employee
Databricks Employee

Why Entire Files Go to badRecordsPath

When you enable badRecordsPath in Autoloader or in Sparkโ€™s file readers (with formats like CSV/JSON), hereโ€™s what happens:

  • Spark expects each data file to be internally well-formed with respect to the declared schema.

  • If Spark encounters a fatal error while reading an entire fileโ€”for example, due to corrupt encoding, mismatched row/column structure, or invalid file formatโ€”it cannot reliably parse any part of the file.

  • As a result, the entire file is redirected to badRecordsPath, even if most of its content is good, because Spark cannot safely guarantee the integrity of any parsed rows from that file.

  • Per-record handling in badRecordsPath only occurs if Spark can read the file but finds a few faulty rows; when the file cannot be opened or parsed at all, the whole file is marked as "bad."

Typical Root Causes

  • Schema Mismatch: The fileโ€™s structure doesnโ€™t match the schema (e.g., wrong delimiter, extra/missing columns).

  • File Corruption: The file is truncated or not a valid CSV/JSON/Parquet file.

  • Encoding Errors: The fileโ€™s encoding doesnโ€™t match what Spark expects (e.g., UTF-8).

  • Header/Footer Issues: If a file has an unexpected header, footer, or partial content.

So please validate your data file for which you are facing issue and check if you see any of the issue specified above

I have already analysed the issue and, yes, the schema doesn't match one of the rows, and it moved the complete file into badRecords and I have seen the behavior and that's fine, and thanks for the response.