cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to remove extra ENTER line in csv UTF-16 while reading

shamly
New Contributor III

Dear Friends,

I have a csv and it looks like this

‡‡Id‡‡,‡‡Version‡‡,‡‡Questionnaire‡‡,‡‡Date‡‡

‡‡123456‡‡,‡‡Version2‡‡,‡‡All questions have been answered accurately

and the guidance in the questionnaire was understood and followed‡‡,‡‡2010-12-16 00:01:48.020000000‡‡

There is an extra ENTER line "and the guidance in the questionnaire was understood and followed" this part is coming as a new line in the csv.  Source file encoding is UTF-16 LE BOM.

At the end of every line, I have CRLF and at the end of every ENTER extra line, I have LF

I should mention in my code something like lineSep \r\n ,but how?

I wrote below code to read this csv

dff = spark.read.option("header", "true") \

.option("inferSchema", "true") \

.option('multiline', 'true') \

.option('encoding', 'UTF-16') \

.option("delimiter", "‡‡,‡‡") \

.csv("/mnt/path/data.csv")

dffs_headers = dff.dtypes

for i in dffs_headers:

  columnLabel = i[0]

  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')

  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))

  if columnLabel != newColumnLabel:

    dff = dff.drop(columnLabel)

    display(dff)

But, in the result is not correct as for the given Id, Questionnaire column data is breaking after "All questions have been answered accurately" and displayed in the next row. I want the entire textbetween the doubledagger "‡‡,‡‡" to be read as one row, even if there is any extra ENTER line.

Please help friends @Aviral Bhardwaj​ 

@DataBricksHelp232​ @Rahul@Databricks​ @Uma Dacharla​ @Uma Maheswara Rao Desula​ 

3 REPLIES 3

Aviral-Bhardwaj
Esteemed Contributor III

This is working fine,

image 

from pyspark.sql.functions import regexp_replace
 
path="dbfs:/FileStore/df/test.csv"
dff = spark.read.option("header", "true").option("inferSchema", "true").option('multiline', 'true').option('encoding', 'UTF-8').option("delimiter", "‡‡,‡‡").csv(path)
 
for i in dffs_headers:
  columnLabel = i[0]
  newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
  
  dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
  
  if columnLabel != newColumnLabel:
    dff = dff.drop(columnLabel)
  dff.show(truncate=False)

Please select my answer as the best answer it will be a great help

Thanks

Aviral Bhardwaj

AviralBhardwaj

Hi,

This is not working for me as the source file encoding is UTF-16 LE BOM.

At the end of every line, I have CRLF and at the end of every ENTER extra line,I have LF

I should mention in my code something like lineSep \r\n ,but how?

Aviral-Bhardwaj
Esteemed Contributor III

connect with me here - https://www.linkedin.com/in/aviralb/

We will try to solve in live call

AviralBhardwaj

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group