06-09-2022 02:39 PM
I have data like, below and when reading as CSV, I don't want to consider comma when its within the quotes even if the quotes are not immediate to the separator (like record #2). 1 and 3 records are good if we use separator, but failing on 2nd record
Input:
col1, col2, col3
a, b, c
a, b1 "b2, b3" b4, c
"a1, a2", b, c
Output:
06-09-2022 04:39 PM
https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option
Escape quotes is the config you're looking for.
06-10-2022 06:36 AM
Hi Joseph... I tried but a, b1 "b2, b3" b4, c row needs to convert to 3 columns as below (Expected output), but b series data are divided into 2 columns instead of single column - requirement is to ignore the comma inside quotes in 2nd column.
Expected output:
1) a
2) b1 "b2, b3" b4
3) c
Actual output:
1) a
2) b1 "b2
3) b3" b4
Thanks,
Satya
06-29-2022 03:19 PM
Following approach can be taken -
07-29-2022 11:40 AM
Hi @SATYANARAYANA ALAMANDA,
Just a friendly follow-up. Did any of the responses help you to resolve your question? if it did, please mark it as best. Otherwise, please let us know if you still need help.
08-04-2022 07:06 AM
Hi, I think you can use this option for the csvReadee
spark.read.options(header = True, sep = ",", unescapedQuoteHandling = "BACK_TO_DELIMITER").csv("your_file.csv")
especially the unescapedQuoteHandling. You can search for the other options at this link
https://spark.apache.org/docs/latest/sql-data-sources-csv.html
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.