How to remove extra ENTER line in csv UTF-16 while reading
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-08-2023 08:20 AM
Dear Friends,
I have a csv and it looks like this
โกโกIdโกโก,โกโกVersionโกโก,โกโกQuestionnaireโกโก,โกโกDateโกโก
โกโก123456โกโก,โกโกVersion2โกโก,โกโกAll questions have been answered accurately
and the guidance in the questionnaire was understood and followedโกโก,โกโก2010-12-16 00:01:48.020000000โกโก
There is an extra ENTER line "and the guidance in the questionnaire was understood and followed" this part is coming as a new line in the csv. Source file encoding is UTF-16 LE BOM.
At the end of every line, I have CRLF and at the end of every ENTER extra line, I have LF
I should mention in my code something like lineSep \r\n ,but how?
I wrote below code to read this csv
dff = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.option('multiline', 'true') \
.option('encoding', 'UTF-16') \
.option("delimiter", "โกโก,โกโก") \
.csv("/mnt/path/data.csv")
dffs_headers = dff.dtypes
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('โกโก','').replace('โกโก','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\โกโก|\\โกโก$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
display(dff)
But, in the result is not correct as for the given Id, Questionnaire column data is breaking after "All questions have been answered accurately" and displayed in the next row. I want the entire textbetween the doubledagger "โกโก,โกโก" to be read as one row, even if there is any extra ENTER line.
Please help friends @Aviral Bhardwajโ
@DataBricksHelp232โ @Rahul@Databricksโ @Uma Dacharlaโ @Uma Maheswara Rao Desulaโ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-08-2023 08:33 PM
This is working fine,
from pyspark.sql.functions import regexp_replace
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "โกโก,โกโก").csv(path)
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('โกโก','').replace('โกโก','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\โกโก|\\โกโก$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
dff.show(truncate=False)
Please select my answer as the best answer it will be a great help
Thanks
Aviral Bhardwaj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-08-2023 11:57 PM
Hi,
This is not working for me as the source file encoding is UTF-16 LE BOM.
At the end of every line, I have CRLF and at the end of every ENTER extra line,I have LF
I should mention in my code something like lineSep \r\n ,but how?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ01-09-2023 12:12 AM
connect with me here - https://www.linkedin.com/in/aviralb/
We will try to solve in live call

