Databricks

ASN · ‎06-09-2022

I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record

Input:

col1, col2, col3

a, b, c

a, b1 "b2, b3" b4, c

"a1, a2", b, c

Output:

Anonymous · ‎06-09-2022

https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option

Escape quotes is the config you're looking for.

ASN · ‎06-10-2022

Hi Joseph... I tried but a, b1 "b2, b3" b4, c row needs to convert to 3 columns as below (Expected output), but b series data are divided into 2 columns instead of single column - requirement is to ignore the comma inside quotes in 2nd column.

Expected output:

1) a

2) b1 "b2, b3" b4

3) c

Actual output:

1) a

2) b1 "b2

3) b3" b4

Thanks,

Satya

dhara1314 · ‎06-29-2022

Following approach can be taken -

Replace your delimiter from comma to something else like pipe , semicolon
Provide escapeQuote option as true when you use spark.read

jose_gonzalez · ‎07-29-2022

Hi @SATYANARAYANA ALAMANDA,

Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.

Pholo · ‎08-04-2022

Hi, I think you can use this option for the csvReadee

spark.read.options(header = True, sep = ",",  unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")

especially the unescapedQuoteHandling. You can search for the other options at this link

https://spark.apache.org/docs/latest/sql-data-sources-csv.html

Databricks

Python Read csv - Don't consider comma when its within the quotes, even if the quotes are not immediate to the separator

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI