<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to import data and apply multiline and charset UTF8 at the same time? in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29124#M20881</link>
    <description>&lt;P&gt;You could also potentially use the .withColumns() function  on the data frame, and use the pyspark.sql.functions.encode function to convert the characterset to the one you need. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://diangermishuizen.com/convert-the-character-set-encoding-of-a-string-field-in-a-pyspark-dataframe-on-databricks/" alt="https://diangermishuizen.com/convert-the-character-set-encoding-of-a-string-field-in-a-pyspark-dataframe-on-databricks/" target="_blank"&gt;Convert the Character Set/Encoding of a String field in a PySpark DataFrame on Databricks - diangermishuizen.com&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Sat, 25 Sep 2021 11:18:12 GMT</pubDate>
    <dc:creator>DianGermishuize</dc:creator>
    <dc:date>2021-09-25T11:18:12Z</dc:date>
    <item>
      <title>How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29116#M20873</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I'm running Spark 2.2.0 at the moment. Currently I'm facing an issue when importing data of Mexican origin, where the characters can have special characters and with multiline for certain columns.&lt;/P&gt;
&lt;P&gt;Ideally, this is the command I'd like to run:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;T_new_exp = spark.read\   
.option("charset", "ISO-8859-1")\   
.option("parserLib", "univocity")\
.option("multiLine", "true")\   
.schema(schema)\
.csv(file)&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;However, using the above gives me properly lined rows but without the correct charset. Instead of displaying e acute for example, I'm getting the replacement character (U+FFFD). It's only when I remove the multiline option do I get the right charset (but without the multiline issue being fix).&lt;/P&gt;
&lt;P&gt;The only solution that I have to workaround this problem for now is to preprocess the data separately before it is loaded to databricks; that is - fix the multiline first in unix and let Databricks handle the unicode issues later.&lt;/P&gt;
&lt;P&gt;Is there a simpler way than this?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Nov 2017 09:51:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29116#M20873</guid>
      <dc:creator>HafidzZulkifli</dc:creator>
      <dc:date>2017-11-13T09:51:02Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29117#M20874</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Did you tired encoding option ? .option("encoding", "UTF-8") .csv(inputPath)&lt;/P&gt;,
&lt;P&gt;did you tried utf8 option ?&lt;/P&gt;
&lt;P&gt; .option("encoding", "UTF-8") .csv(inputPath)&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Aug 2018 12:43:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29117#M20874</guid>
      <dc:creator>kali_tummala</dc:creator>
      <dc:date>2018-08-29T12:43:11Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29118#M20875</link>
      <description>&lt;P&gt;@Hafidz Zulkifli​&amp;nbsp;check my answer&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 Aug 2018 12:44:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29118#M20875</guid>
      <dc:creator>kali_tummala</dc:creator>
      <dc:date>2018-08-29T12:44:18Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29119#M20876</link>
      <description>&lt;P&gt;@kali.tummala@gmail.com​&amp;nbsp; Tried it just now. It didn't work. There are two parts to the problem - one is handling multiline. The other is to handle differing charset. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Aug 2018 02:58:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29119#M20876</guid>
      <dc:creator>HafidzZulkifli</dc:creator>
      <dc:date>2018-08-30T02:58:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29120#M20877</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Are you sure it's the parsing that's the issue, and not simply the display?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 07 Sep 2018 13:58:01 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29120#M20877</guid>
      <dc:creator>sean_owen</dc:creator>
      <dc:date>2018-09-07T13:58:01Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29121#M20878</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi , &lt;/P&gt;
&lt;P&gt;Did anyone find any solution for this.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 01 Oct 2019 11:32:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29121#M20878</guid>
      <dc:creator>Smruti</dc:creator>
      <dc:date>2019-10-01T11:32:52Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29122#M20879</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Please make sure you are using or enforcing python 3. python 2 is default and it will have issues with encoding&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 22 Apr 2020 17:17:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29122#M20879</guid>
      <dc:creator>nsuguru310</dc:creator>
      <dc:date>2020-04-22T17:17:53Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29123#M20880</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;.option("charset", "iso-8859-1")&lt;/P&gt;&lt;P&gt;&lt;/P&gt; .option("multiLine", True)&lt;P&gt;&lt;/P&gt; .option("lineSep ",'\n\r') 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 27 May 2020 13:22:17 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29123#M20880</guid>
      <dc:creator>MikeDuwee</dc:creator>
      <dc:date>2020-05-27T13:22:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to import data and apply multiline and charset UTF8 at the same time?</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29124#M20881</link>
      <description>&lt;P&gt;You could also potentially use the .withColumns() function  on the data frame, and use the pyspark.sql.functions.encode function to convert the characterset to the one you need. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://diangermishuizen.com/convert-the-character-set-encoding-of-a-string-field-in-a-pyspark-dataframe-on-databricks/" alt="https://diangermishuizen.com/convert-the-character-set-encoding-of-a-string-field-in-a-pyspark-dataframe-on-databricks/" target="_blank"&gt;Convert the Character Set/Encoding of a String field in a PySpark DataFrame on Databricks - diangermishuizen.com&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 25 Sep 2021 11:18:12 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-import-data-and-apply-multiline-and-charset-utf8-at-the/m-p/29124#M20881</guid>
      <dc:creator>DianGermishuize</dc:creator>
      <dc:date>2021-09-25T11:18:12Z</dc:date>
    </item>
  </channel>
</rss>

