topic Re: How to import data and apply multiline and charset UTF8 at the same time? in Data Engineering

How to import data and apply multiline and charset UTF8 at the same time?

HafidzZulkifli — Mon, 13 Nov 2017 09:51:02 GMT

I'm running Spark 2.2.0 at the moment. Currently I'm facing an issue when importing data of Mexican origin, where the characters can have special characters and with multiline for certain columns.

Ideally, this is the command I'd like to run:

T_new_exp = spark.read\   
.option("charset", "ISO-8859-1")\   
.option("parserLib", "univocity")\
.option("multiLine", "true")\   
.schema(schema)\
.csv(file)

However, using the above gives me properly lined rows but without the correct charset. Instead of displaying e acute for example, I'm getting the replacement character (U+FFFD). It's only when I remove the multiline option do I get the right charset (but without the multiline issue being fix).

The only solution that I have to workaround this problem for now is to preprocess the data separately before it is loaded to databricks; that is - fix the multiline first in unix and let Databricks handle the unicode issues later.

Is there a simpler way than this?

Re: How to import data and apply multiline and charset UTF8 at the same time?

kali_tummala — Wed, 29 Aug 2018 12:43:11 GMT

Did you tired encoding option ? .option("encoding", "UTF-8") .csv(inputPath)

did you tried utf8 option ?

.option("encoding", "UTF-8") .csv(inputPath)

Re: How to import data and apply multiline and charset UTF8 at the same time?

kali_tummala — Wed, 29 Aug 2018 12:44:18 GMT

@Hafidz Zulkifli check my answer

Re: How to import data and apply multiline and charset UTF8 at the same time?

HafidzZulkifli — Thu, 30 Aug 2018 02:58:01 GMT

@kali.tummala@gmail.com Tried it just now. It didn't work. There are two parts to the problem - one is handling multiline. The other is to handle differing charset.

Re: How to import data and apply multiline and charset UTF8 at the same time?

sean_owen — Fri, 07 Sep 2018 13:58:01 GMT

Are you sure it's the parsing that's the issue, and not simply the display?

Re: How to import data and apply multiline and charset UTF8 at the same time?

Smruti — Tue, 01 Oct 2019 11:32:52 GMT

Hi ,

Did anyone find any solution for this.

Re: How to import data and apply multiline and charset UTF8 at the same time?

nsuguru310 — Wed, 22 Apr 2020 17:17:53 GMT

Please make sure you are using or enforcing python 3. python 2 is default and it will have issues with encoding

Re: How to import data and apply multiline and charset UTF8 at the same time?

MikeDuwee — Wed, 27 May 2020 13:22:17 GMT

.option("charset", "iso-8859-1")

.option("multiLine", True)

.option("lineSep ",'\n\r')

Re: How to import data and apply multiline and charset UTF8 at the same time?

DianGermishuize — Sat, 25 Sep 2021 11:18:12 GMT

You could also potentially use the .withColumns() function on the data frame, and use the pyspark.sql.functions.encode function to convert the characterset to the one you need.

Convert the Character Set/Encoding of a String field in a PySpark DataFrame on Databricks - diangermishuizen.com