Databricks Community

bradm0 · ‎09-20-2022

I'm trying to use the badRecordsPath to catch improperly formed records in a CSV file and continue loading the remainder of the file. I can get the option to work using python like this

df = spark.read\
 .format("csv")\
 .option("header","true")\
 .option("badRecordsPath","wasbs://user@user.blob.core.windows.net/badRecords/")\
 .schema("col1 integer,col2 integer")\
 .load("wasbs://user@user.blob.core.windows.net/*.csv")

But when I try to use it in the COPY INTO SQL command I can't get it to work. The record comes in with PERMISSIVE mode and nulls are put in place of the bad data. I've tried putting the badRecordsPath clause in both the format and copy options and get the same result.

copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
 
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',')
copy_options ('badRecordsPath' = 'user@user.blob.core.windows.net/badRecords')

Here's the version of Databricks I'm using

{"dbr_version": "11.2.x-scala2.12", "dbsql_version": null, "u_build_hash": "38130494fa061680c589459fb2a35b344ea1a814", "r_build_hash": "c45f53067ddc71ffe1ee06997bad04b3c7143650"}

Appreciate any guidance. Thanks.

Hubert-Dudek · ‎09-20-2022

I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.

Please try to remove select, and you can use inferSchema to avoid CAST.

copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
                                 'sep' = ',',
                                 'inferSchema' = 'true',
                                 'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')

View solution in original post

Hubert-Dudek · ‎09-20-2022

I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.

Please try to remove select, and you can use inferSchema to avoid CAST.

copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
                                 'sep' = ',',
                                 'inferSchema' = 'true',
                                 'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')

bradm0 · ‎09-20-2022

Thanks. It was the inferSchema setting. I tried it with and without the SELECT and it worked both ways when I added inferSchema

Both of these worked

drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
 
 
drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from  'wasbs://user@user.blob.core.windows.net/*.csv'
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')