cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Use of badRecordsPath in COPY INTO SQL command

bradm0
New Contributor III

I'm trying to use the badRecordsPath to catch improperly formed records in a CSV file and continue loading the remainder of the file. I can get the option to work using python like this

df = spark.read\
 .format("csv")\
 .option("header","true")\
 .option("badRecordsPath","wasbs://user@user.blob.core.windows.net/badRecords/")\
 .schema("col1 integer,col2 integer")\
 .load("wasbs://user@user.blob.core.windows.net/*.csv")

But when I try to use it in the COPY INTO SQL command I can't get it to work. The record comes in with PERMISSIVE mode and nulls are put in place of the bad data. I've tried putting the badRecordsPath clause in both the format and copy options and get the same result.

copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
 
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',')
copy_options ('badRecordsPath' = 'user@user.blob.core.windows.net/badRecords')

Here's the version of Databricks I'm using

{"dbr_version": "11.2.x-scala2.12", "dbsql_version": null, "u_build_hash": "38130494fa061680c589459fb2a35b344ea1a814", "r_build_hash": "c45f53067ddc71ffe1ee06997bad04b3c7143650"}

 Appreciate any guidance. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.

Please try to remove select, and you can use inferSchema to avoid CAST.

copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
                                 'sep' = ',',
                                 'inferSchema' = 'true',
                                 'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.

Please try to remove select, and you can use inferSchema to avoid CAST.

copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
                                 'sep' = ',',
                                 'inferSchema' = 'true',
                                 'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')

bradm0
New Contributor III

Thanks. It was the inferSchema setting. I tried it with and without the SELECT and it worked both ways when I added inferSchema

Both of these worked

drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
 
 
drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from  'wasbs://user@user.blob.core.windows.net/*.csv'
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')

Meghala
Valued Contributor II

It's an very informative

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group