cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Catch rejected Data ( Rows ) while reading with Apache-Spark.

sarvesh
Contributor III

I work with Spark-Scala and I receive Data in different formats ( .csv/.xlxs/.txt etc ), when I try to read/write this data from different sources to a any database, many records got rejected due to various issues like (special characters, data type difference between source and target table etc. In such cases, my entire load gets failed.

what I want is a way to capture the rejected rows into separate file and continue to load remaining correct records in database table.

basically not to stop the flow of the program due to some rows, and catch these problem causing rows.

example -

I read a .csv with 98 perfect rows and 2 corrupt rows, I want to read/write 98 rows into the database and send 2 corrupt rows to the user as a file.

P.S. I am receiving data from user so i can't define a schema, i need a dynamic way to read the file and filter out the corrupt data in a file.

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

you can save corrupted records to separate file:

.option("badRecordsPath", "/tmp/badRecordsPath")

allow spark to process corrupted row:

.option("mode", "PERMISSIVE")

you can also create special column for corrupted records:

df = spark.read.csv('/tmp/inputFile.csv', header=True, schema=dataSchema, enforceSchema=True, 
 
columnNameOfCorruptRecord='CORRUPTED')

sarvesh
Contributor III

Thank you for replying, but what I am trying to develop is a function that can take data from user and filter out corrupt records if any, that is I want to do the same thing you did but without a defined Schema.

sarvesh
Contributor III

I might get data from some external source and i can't define a schema for the data which is read on my website/app.

-werners-
Esteemed Contributor III

maybe Delta Live tables?

Not sure if it is what you are looking for, haven't used it myself. But you have schema evolution and expectations so it might bring you there.

-werners-
Esteemed Contributor III

or maybe schema evolution on delta lake is enough, in combination with Hubert's answer

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group