topic Re: Handle comma inside cell of CSV in Data Engineering

Handle comma inside cell of CSV

AnandJ_Kadhi — Fri, 18 Aug 2017 12:47:44 GMT

We are using spark-csv_2.10 > version 1.5.0

and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpreted properly due to that.

Can you please suggest any solution over the same ?

Re: Handle comma inside cell of CSV

osamakhn — Wed, 31 Jan 2018 07:00:02 GMT

I have been solving this with a pandas intermediary function but spark solution would be helpful! I am willing to contribute as well if anyone can point me in the right direction

Re: Handle comma inside cell of CSV

User16857282152 — Fri, 01 Nov 2019 17:27:53 GMT

Take a look here for options,

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv

If a csv file has commas then the tradition is to quote the string that contains the comma,

In particular see if adding some of the options from that documentation such as.

quote – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value,

. If you would like to turn off quotations, you need to set an empty string.

Also,

You may have poorly formatted data, in that case you might need to read the whole line as a string and then parse as a dataframe with single column and use tools to split the string to create the needed final dataframe