06-09-2022 02:39 PM
I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record
Input:
col1, col2, col3
a, b, c
a, b1 "b2, b3" b4, c
"a1, a2", b, c
Output:
06-09-2022 04:39 PM
https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Escape quotes is the config you're looking for.
06-10-2022 06:36 AM
Hi Joseph... I tried but a, b1 "b2, b3" b4, c row needs to convert to 3 columns as below (Expected output), but b series data are divided into 2 columns instead of single column - requirement is to ignore the comma inside quotes in 2nd column.
Expected output:
1) a
2) b1 "b2, b3" b4
3) c
Actual output:
1) a
2) b1 "b2
3) b3" b4
Thanks,
Satya
06-29-2022 03:19 PM
Following approach can be taken -
07-29-2022 11:40 AM
Hi @SATYANARAYANA ALAMANDA,
Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.
08-04-2022 07:06 AM
Hi, I think you can use this option for the csvReadee
spark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")
especially the unescapedQuoteHandling. You can search for the other options at this link
https://spark.apache.org/docs/latest/sql-data-sources-csv.html
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group