Databricks Community

shamly · ‎01-09-2023

I have tried several code and nothing worked. An extra space or line LF is going to next row in my output. All rows are ending in CRLF, but some rows end in LF and while reading the csv, it is not giving correct output. My csv have double dagger as delimitter

csv looks like this

‡‡Id‡‡,‡‡Version‡‡,‡‡Questionnaire‡‡,‡‡Date‡‡

‡‡123456‡‡,‡‡Version2‡‡,‡‡All questions have been answered accurately

and the guidance in the questionnaire was understood and followed‡‡,‡‡2010-12-16 00:01:48.020000000‡‡

I tried below code

dff = spark.read.option("header", "true") \

.option("inferSchema", "true") \

.option('encoding', 'UTF-16') \

.option("delimiter", "‡‡,‡‡") \

.option("multiLine", True) \

.csv("/mnt/path/data.csv")

dffs_headers = dff.dtypes

display(dff)

for i in dffs_headers:

columnLabel = i[0]

newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')

dff=dff.withColumn(newColumnLabel,regexp_replace(columnLabel,'^\\‡‡|\\‡‡$',''))

if columnLabel != newColumnLabel:

dff = dff.drop(columnLabel)

display(dff)

Can I use regex replace .regexp_replace('?<!\r)\n','') but how and where ?

Please help @ArunKumar-Databricks @Gustavo Barreto @ANUJ GARG @

RaghavendraY · ‎01-10-2023

Can you share a sample file with rows ending in CRLF, and in LF

Chaitanya_Raju · ‎01-10-2023

Hi @shamly pt ,

Can you please share the sample file with the ***** data and also the expected output, so that we can try it at our end and let you know.

Happy Learning!!

Thanks for reading and like if this is useful and for improvements or feedback please comment.

sher · ‎01-11-2023

hi

import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc);
 
val df = sqlContext.read.format("csv")
            .option("header", "true")
            .option("delimiter", "your delimiter")
            .option("inferSchema",true")
            .load("csv file")

can you try this. if this not work

then you need to read the file in RDD and convert to df and write back to CSV

CSV --> RDD --> DF --> FINAL_OUTPUT format

sher · ‎01-11-2023

val df = spark.read.format("csv")
              .option("header",true)
                .option("sep","||")
                  .load("file load")
display(df)  
 
try this

Databricks Community

How to replace LF and replace with ' ' in csv UTF-16 encoded?

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon