cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

need to remove doubledagger delimiter from a csv using databricks

shamly
New Contributor III

My csv data looks like this

โ€กโ€กcompanyIdโ€กโ€ก,โ€กโ€กempIdโ€กโ€ก,โ€กโ€กregionIdโ€กโ€ก,โ€กโ€กcompanyVersionโ€กโ€ก,โ€กโ€กQuestionโ€กโ€ก

I tried this code

dff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "โ€ก,").csv(f"/mnt/data/path/datafile.csv")

But I am getting space after each characters in between my results

๏ฟฝ๏ฟฝ ! ! c o m p a n y I d ! ! , ! ! e m p I d ! ! , ! ! r e g i o n I d ! ! , ! !

please help

  

1 ACCEPTED SOLUTION

Accepted Solutions

UmaMahesh1
Honored Contributor III

Hi @shamly ptโ€‹ 

It's a indentation error. Check if you are following proper indentation.

I guess your error is at  if columnLabel != newColumnLabel:

  dff = dff.drop(columnLabel).

Just hit a Tab or give indentation spaces before dff = dff.drop(columnLabel)

cheers..

View solution in original post

9 REPLIES 9

Hubert-Dudek
Esteemed Contributor III

You can include a sample of your CSV as an attachment in CSV format would be easier.

Maybe it is enough to add space in delimiter .option("delimiter", "โ€ก, ").

Another option is to use external applications to purge chars from CSV.

We can also load the whole line into a dataframe and split it using spark SQL string functions https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#string-function...

UmaMahesh1
Honored Contributor III

Hi @shamly ptโ€‹ 

I took a bit another approach since I guess no one would be sure of the the encoding of the data you showed.

Sample data I took :

โ€กโ€กcompanyIdโ€กโ€ก,โ€กโ€กempIdโ€กโ€ก,โ€กโ€กregionIdโ€กโ€ก,โ€กโ€กcompanyVersionโ€กโ€ก,โ€กโ€กQuestionโ€กโ€ก

โ€กโ€ก1โ€กโ€ก,โ€กโ€ก121212โ€กโ€ก,โ€กโ€กRโ€กโ€ก,โ€กโ€ก1.0Aโ€กโ€ก,โ€กโ€กNAโ€กโ€ก

My approach:

First reading that data as is turning off header and sep as ','. Just renamed _c0 to col1 for visual purpose. Then created a split column, separated the values and regex replaced the data. Finally filtered out the row which contained header as I already aliased the dataframe.

dff = spark.read.option("header", "false").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").withColumnRenamed("_c0", "col1")
 
split_col = pyspark.sql.functions.split(dff['col1'], ',')
 
df2 = dff.select(regexp_replace(split_col.getItem(0), "[^0-9a-zA-Z_\-]+", "").alias('companyId'),\
                 regexp_replace(split_col.getItem(1), "[^0-9a-zA-Z_\-]+", "").alias('empId'), \
                regexp_replace(split_col.getItem(2), "[^0-9a-zA-Z_\-]+", "").alias('regionId'), \
                regexp_replace(split_col.getItem(3), "[^0-9a-zA-Z_\-]+", "").alias('companyVersion'), \
                regexp_replace(split_col.getItem(4), "[^0-9a-zA-Z_\-]+", "").alias('Question')) \
 
output = df2.where(df2.companyId != 'companyId')

My output:

imageHope this helps...Cheers...

shamly
New Contributor III

Dear @Uma Maheswara Rao Desulaโ€‹ Thanks for your help. I am able to split into columns.I have almost 90 columns. So, is there any way that I can automate?

I tried below code, but it is not giving clean seperated columns as ur code.

dff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "โ€กโ€ก,โ€กโ€ก").csv(filepath)

dffs_headers = dff.dtypes

for i in dffs_headers:

  columnLabel = i[0]

  newColumnLabel = columnLabel.replace('โ€กโ€ก','').replace('โ€กโ€ก','') 

  dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\โ€กโ€ก|\\โ€กโ€ก$', ''))

  if columnLabel != newColumnLabel:

   dff = dff.drop(columnLabel)

UmaMahesh1
Honored Contributor III

Hi @shamly ptโ€‹ 

May I know what was the data mismatch issue you were facing using the previous code ?

shamly
New Contributor III

Hi @Uma Maheswara Rao Desulaโ€‹ There was no data match issues. It was my mistake in validating data. Data is coming perfectly. The problem is I have 90 columns. So is there any way to reduce manual effort something like below ?

for i in dffs_headers:

  columnLabel = i[0]

  newColumnLabel = columnLabel.replace('โ€กโ€ก','').replace('โ€กโ€ก','') 

  dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\โ€กโ€ก|\\โ€กโ€ก$', ''))

  if columnLabel != newColumnLabel:

  dff = dff.drop(columnLabel)

UmaMahesh1
Honored Contributor III

Hi @shamly ptโ€‹ ,

As an automated way if you don't know the schema beforehand would look something like this...

dff_1 = spark.read.option("header", "false").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").withColumnRenamed("_c0", "col1")
 
split_col = pyspark.sql.functions.split(dff_1['col1'], ',')
 
# Building the header field names
 
header_uncleaned = (spark.read.option("header", "true").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").columns[0]).split(",")
header = []
for i in header_uncleaned:
  header.append(''.join(e for e in i if e.isalnum()))
 
# Looping over the column names and populating data
dff_1 = df1
for i in range(len(header)):
  print(i)
  df1 = df1.withColumn(header[i], regexp_replace(split_col.getItem(i), "[^0-9a-zA-Z_\-]+", ""))
df1 = df1.drop("col1").filter(col("companyId") != "companyId")
 
display(df1)

shamly
New Contributor III

Hi @Uma Maheswara Rao Desulaโ€‹ 

I have written this code, as I have many files in many folders in same location and everything is UTF-16. This is giving me proper result as below

dff = spark.read.option("header", "true") \

.option("inferSchema", "true") \

.option('encoding', 'UTF-16') \

.option('multiline', 'true') \

.option("delimiter", "โ€กโ€ก,โ€กโ€ก") \

.csv("/mnt/data/file.csv")

display(dff)

โ€กโ€กCompanyId Companyname CountryIdโ€กโ€ก

โ€กโ€ก1234 abc cnโ€กโ€ก

โ€กโ€ก2345 def usโ€กโ€ก

โ€กโ€ก3457 ghi syโ€กโ€ก

โ€กโ€ก7564 lmn ukโ€กโ€ก

Now, I want to remove the start and end double daggers and I wrote below code and it is giving me error "IndentationError: expected an indented block"

from pyspark.sql.functions import regexp_replace

dffs_headers = dff.dtypes

for i in dffs_headers:

  columnLabel = i[0]

  newColumnLabel = columnLabel.replace('โ€กโ€ก','').replace('โ€กโ€ก','') 

  dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\โ€กโ€ก|\\โ€กโ€ก$', ''))

  if columnLabel != newColumnLabel:

  dff = dff.drop(columnLabel)

display(dff)

error "IndentationError: expected an indented block"

UmaMahesh1
Honored Contributor III

Hi @shamly ptโ€‹ 

It's a indentation error. Check if you are following proper indentation.

I guess your error is at  if columnLabel != newColumnLabel:

  dff = dff.drop(columnLabel).

Just hit a Tab or give indentation spaces before dff = dff.drop(columnLabel)

cheers..

shamly
New Contributor III

Thankyou it worked ๐Ÿ™‚

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.