How to replace LF and replace with ' ' in csv UTF-16 encoded?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-09-2023 11:48 AM
I have tried several code and nothing worked. An extra space or line LF is going to next row in my output. All rows are ending in CRLF, but some rows end in LF and while reading the csv, it is not giving correct output. My csv have double dagger as delimitter
csv looks like this
‡‡Id‡‡,‡‡Version‡‡,‡‡Questionnaire‡‡,‡‡Date‡‡
‡‡123456‡‡,‡‡Version2‡‡,‡‡All questions have been answered accurately
and the guidance in the questionnaire was understood and followed‡‡,‡‡2010-12-16 00:01:48.020000000‡‡
I tried below code
dff = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.option('encoding', 'UTF-16') \
.option("delimiter", "‡‡,‡‡") \
.option("multiLine", True) \
.csv("/mnt/path/data.csv")
dffs_headers = dff.dtypes
display(dff)
for i in dffs_headers:
columnLabel = i[0]
newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')
dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))
if columnLabel != newColumnLabel:
dff = dff.drop(columnLabel)
display(dff)
Can I use regex replace .regexp_replace('?<!\r)\n','') but how and where ?
Please help @ArunKumar-Databricks @Gustavo Barreto @ANUJ GARG @
- Labels:
-
Azure databricks
-
Pyspark
-
Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2023 04:48 AM
Can you share a sample file with rows ending in CRLF, and in LF
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-10-2023 05:59 AM
Hi @shamly pt ,
Can you please share the sample file with the ***** data and also the expected output, so that we can try it at our end and let you know.
Happy Learning!!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 09:37 AM
hi
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc);
val df = sqlContext.read.format("csv")
.option("header", "true")
.option("delimiter", "your delimiter")
.option("inferSchema",true")
.load("csv file")
can you try this. if this not work
then you need to read the file in RDD and convert to df and write back to CSV
CSV --> RDD --> DF --> FINAL_OUTPUT format
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 09:39 AM
val df = spark.read.format("csv")
.option("header",true)
.option("sep","||")
.load("file load")
display(df)
try this

