<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Handle comma inside cell of CSV in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29152#M20909</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Take a look here for options, &lt;/P&gt;
&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv" target="test_blank"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;If a csv file has commas then the tradition is to quote the string that contains the comma, &lt;/P&gt;
&lt;P&gt;In particular see if adding some of the options from that documentation such as. &lt;/P&gt;
&lt;P&gt;&lt;B&gt;quote&lt;/B&gt; – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, &lt;PRE&gt;&lt;CODE&gt;"&lt;/CODE&gt;&lt;/PRE&gt;. If you would like to turn off quotations, you need to set an empty string.&lt;/P&gt;
&lt;P&gt;Also, &lt;/P&gt;
&lt;P&gt;You may have poorly formatted data, in that case you might need to read the whole line as a string and then parse as a dataframe with single column and use tools to split the string to create the needed final dataframe&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 01 Nov 2019 17:27:53 GMT</pubDate>
    <dc:creator>User16857282152</dc:creator>
    <dc:date>2019-11-01T17:27:53Z</dc:date>
    <item>
      <title>Handle comma inside cell of CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29150#M20907</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;We are using &lt;B&gt;spark-csv_2.10 &amp;gt; version 1.5.0 &lt;/B&gt;&lt;/P&gt;
&lt;P&gt;and reading the csv file column which contains comma " , " as one of the character. The problem we are facing is like that it treats the rest of line after the comma as new column and data is not interpreted properly due to that.&lt;/P&gt;
&lt;P&gt;Can you please suggest any solution over the same ?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 18 Aug 2017 12:47:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29150#M20907</guid>
      <dc:creator>AnandJ_Kadhi</dc:creator>
      <dc:date>2017-08-18T12:47:44Z</dc:date>
    </item>
    <item>
      <title>Re: Handle comma inside cell of CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29151#M20908</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have been solving this with a pandas intermediary function but spark solution would be helpful! I am willing to contribute as well if anyone can point me in the right direction&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2018 07:00:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29151#M20908</guid>
      <dc:creator>osamakhn</dc:creator>
      <dc:date>2018-01-31T07:00:02Z</dc:date>
    </item>
    <item>
      <title>Re: Handle comma inside cell of CSV</title>
      <link>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29152#M20909</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Take a look here for options, &lt;/P&gt;
&lt;P&gt;&lt;A href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv" target="test_blank"&gt;http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=dataframereader#pyspark.sql.DataFrameReader.csv&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;If a csv file has commas then the tradition is to quote the string that contains the comma, &lt;/P&gt;
&lt;P&gt;In particular see if adding some of the options from that documentation such as. &lt;/P&gt;
&lt;P&gt;&lt;B&gt;quote&lt;/B&gt; – sets a single character used for escaping quoted values where the separator can be part of the value. If None is set, it uses the default value, &lt;PRE&gt;&lt;CODE&gt;"&lt;/CODE&gt;&lt;/PRE&gt;. If you would like to turn off quotations, you need to set an empty string.&lt;/P&gt;
&lt;P&gt;Also, &lt;/P&gt;
&lt;P&gt;You may have poorly formatted data, in that case you might need to read the whole line as a string and then parse as a dataframe with single column and use tools to split the string to create the needed final dataframe&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Nov 2019 17:27:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/handle-comma-inside-cell-of-csv/m-p/29152#M20909</guid>
      <dc:creator>User16857282152</dc:creator>
      <dc:date>2019-11-01T17:27:53Z</dc:date>
    </item>
  </channel>
</rss>

