cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to remove extra ENTER line in csv UTF-16 while reading

shamly
New Contributor III

Dear Friends,

I have a csv and it looks like this

‡‡Id‡‡,‡‡Version‡‡,‡‡Questionnaire‡‡,‡‡Date‡‡

‡‡123456‡‡,‡‡Version2‡‡,‡‡All questions have been answered accurately

and the guidance in the questionnaire was understood and followed‡‡,‡‡2010-12-16 00:01:48.020000000‡‡

There is an extra ENTER line "and the guidance in the questionnaire was understood and followed" this part is coming as a new line in the csv.  Source file encoding is UTF-16 LE BOM.

At the end of every line, I have CRLF and at the end of every ENTER extra line, I have LF

I should mention in my code something like lineSep \r\n ,but how?

I wrote below code to read this csv

dff = spark.read.option("header", "true") \

.option("inferSchema", "true") \

.option('multiline', 'true') \

.option('encoding', 'UTF-16') \

.option("delimiter", "‡‡,‡‡") \

.csv("/mnt/path/data.csv")

dffs_headers = dff.dtypes

for i in dffs_headers:

  columnLabel = i[0]

  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')

  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))

  if columnLabel != newColumnLabel:

    dff = dff.drop(columnLabel)

    display(dff)

But, in the result is not correct as for the given Id, Questionnaire column data is breaking after "All questions have been answered accurately" and displayed in the next row. I want the entire textbetween the doubledagger "‡‡,‡‡" to be read as one row, even if there is any extra ENTER line.

Please help friends @Aviral Bhardwaj​ 

@DataBricksHelp232​ @Rahul@Databricks​ @Uma Dacharla​ @Uma Maheswara Rao Desula​ 

3 REPLIES 3

Aviral-Bhardwaj
Esteemed Contributor III

This is working fine,

image 

from pyspark.sql.functions import regexp_replace
 
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
 
for i in dffs_headers:
  columnLabel = i[0]
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
  dff.show(truncate=False)

Please select my answer as the best answer it will be a great help

Thanks

Aviral Bhardwaj

Hi,

This is not working for me as the source file encoding is UTF-16 LE BOM.

At the end of every line, I have CRLF and at the end of every ENTER extra line,I have LF

I should mention in my code something like lineSep \r\n ,but how?

Aviral-Bhardwaj
Esteemed Contributor III

connect with me here - https://www.linkedin.com/in/aviralb/

We will try to solve in live call

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.