I have tried several code and nothing worked. An extra space or line LF is going to next row in my output. All rows are ending in CRLF, but some rows end in LF and while reading the csv, it is not giving correct output. My csv have double dagger as delimitter
csv looks like this
‡‡Id‡‡,‡‡Version‡‡,‡‡Questionnaire‡‡,‡‡Date‡‡
‡‡123456‡‡,‡‡Version2‡‡,‡‡All questions have been answered accurately
and the guidance in the questionnaire was understood and followed‡‡,‡‡2010-12-16 00:01:48.020000000‡‡
I tried below code
dff = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.option('encoding', 'UTF-16') \
.option("delimiter", "‡‡,‡‡") \
.option("multiLine", True) \
.csv("/mnt/path/data.csv")
dffs_headers = dff.dtypes
display(dff)
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
display(dff)
Can I use regex replace .regexp_replace('?<!\r)\n','') but how and where ?
Please help @ArunKumar-Databricks @Gustavo Barreto @ANUJ GARG @