09-20-2022 07:51 AM
I'm trying to use the badRecordsPath to catch improperly formed records in a CSV file and continue loading the remainder of the file. I can get the option to work using python like this
df = spark.read\
.format("csv")\
.option("header","true")\
.option("badRecordsPath","wasbs://user@user.blob.core.windows.net/badRecords/")\
.schema("col1 integer,col2 integer")\
.load("wasbs://user@user.blob.core.windows.net/*.csv")
But when I try to use it in the COPY INTO SQL command I can't get it to work. The record comes in with PERMISSIVE mode and nulls are put in place of the bad data. I've tried putting the badRecordsPath clause in both the format and copy options and get the same result.
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',')
copy_options ('badRecordsPath' = 'user@user.blob.core.windows.net/badRecords')
Here's the version of Databricks I'm using
{"dbr_version": "11.2.x-scala2.12", "dbsql_version": null, "u_build_hash": "38130494fa061680c589459fb2a35b344ea1a814", "r_build_hash": "c45f53067ddc71ffe1ee06997bad04b3c7143650"}
Appreciate any guidance. Thanks.
09-20-2022 09:40 AM
I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.
Please try to remove select, and you can use inferSchema to avoid CAST.
copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
'sep' = ',',
'inferSchema' = 'true',
'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
09-20-2022 09:40 AM
I think the problem is that it is inside SELECT, so functionality in SQL is lost as it will generate a SELECT statement first.
Please try to remove select, and you can use inferSchema to avoid CAST.
copy into my_db.t2
from ('wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
FORMAT_OPTIONS ('header' = 'true',
'sep' = ',',
'inferSchema' = 'true',
'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
09-20-2022 10:10 AM
Thanks. It was the inferSchema setting. I tried it with and without the SELECT and it worked both ways when I added inferSchema
Both of these worked
drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from (SELECT cast(col1 AS INTEGER) AS `col1`,cast(col2 AS INTEGER) AS `col2` from 'wasbs://user@user.blob.core.windows.net/*.csv')
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
drop table my_db.t2;
create table my_db.t2 (col1 int,col2 int);
copy into my_db.t2
from 'wasbs://user@user.blob.core.windows.net/*.csv'
fileformat=csv
format_options ('header' = 'true','sep' = ',','inferSchema' = 'true' ,'badRecordsPath' = 'wasbs://user@user.blob.core.windows.net/badRecords')
12-26-2022 08:30 AM
It's an very informative
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group