<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic need to remove doubledagger delimiter from a csv using databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19908#M13414</link>
    <description>&lt;P&gt;My csv data looks like this&lt;/P&gt;&lt;P&gt;‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried this code&lt;/P&gt;&lt;P&gt;dff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "‡,").csv(f"/mnt/data/path/datafile.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But I am getting space after each characters in between my results&lt;/P&gt;&lt;P&gt;�� ! ! c o m p a n y I d ! ! , ! ! e m p I d ! !  , ! !  r e g i o n I d ! ! , ! !&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;please help&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 29 Nov 2022 19:41:24 GMT</pubDate>
    <dc:creator>shamly</dc:creator>
    <dc:date>2022-11-29T19:41:24Z</dc:date>
    <item>
      <title>need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19908#M13414</link>
      <description>&lt;P&gt;My csv data looks like this&lt;/P&gt;&lt;P&gt;‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried this code&lt;/P&gt;&lt;P&gt;dff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "‡,").csv(f"/mnt/data/path/datafile.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;But I am getting space after each characters in between my results&lt;/P&gt;&lt;P&gt;�� ! ! c o m p a n y I d ! ! , ! ! e m p I d ! !  , ! !  r e g i o n I d ! ! , ! !&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;please help&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 19:41:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19908#M13414</guid>
      <dc:creator>shamly</dc:creator>
      <dc:date>2022-11-29T19:41:24Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19909#M13415</link>
      <description>&lt;P&gt;You can include a sample of your CSV as an attachment in CSV format would be easier.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Maybe it is enough to add space in delimiter .option("delimiter", "‡, "). &lt;/P&gt;&lt;P&gt;Another option is to use external applications to purge chars from CSV.&lt;/P&gt;&lt;P&gt;We can also load the whole line into a dataframe and split it using spark SQL string functions &lt;A href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#string-functions" target="test_blank"&gt;https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#string-functions&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 20:04:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19909#M13415</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-11-29T20:04:18Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19910#M13416</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Hi @shamly pt​&amp;nbsp;&lt;/P&gt;&lt;P&gt;I took a bit another approach since I guess no one would be sure of the the encoding of the data you showed. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Sample data I took :&lt;/B&gt;&lt;/P&gt;&lt;P&gt;‡‡companyId‡‡,‡‡empId‡‡,‡‡regionId‡‡,‡‡companyVersion‡‡,‡‡Question‡‡&lt;/P&gt;&lt;P&gt;‡‡1‡‡,‡‡121212‡‡,‡‡R‡‡,‡‡1.0A‡‡,‡‡NA‡‡&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;My approach:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;First reading that data as is turning off header and sep as ','. Just renamed _c0 to col1 for visual purpose. Then created a split column, separated the values and regex replaced the data. Finally filtered out the row which contained header as I already aliased the dataframe.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dff = spark.read.option("header", "false").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").withColumnRenamed("_c0", "col1")
&amp;nbsp;
split_col = pyspark.sql.functions.split(dff['col1'], ',')
&amp;nbsp;
df2 = dff.select(regexp_replace(split_col.getItem(0), "[^0-9a-zA-Z_\-]+", "").alias('companyId'),\
                 regexp_replace(split_col.getItem(1), "[^0-9a-zA-Z_\-]+", "").alias('empId'), \
                regexp_replace(split_col.getItem(2), "[^0-9a-zA-Z_\-]+", "").alias('regionId'), \
                regexp_replace(split_col.getItem(3), "[^0-9a-zA-Z_\-]+", "").alias('companyVersion'), \
                regexp_replace(split_col.getItem(4), "[^0-9a-zA-Z_\-]+", "").alias('Question')) \
&amp;nbsp;
output = df2.where(df2.companyId != 'companyId')&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;My output:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="image"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/1096i7B1311AA51C87266/image-size/large?v=v2&amp;amp;px=999" role="button" title="image" alt="image" /&gt;&lt;/span&gt;Hope this helps...Cheers...&lt;/P&gt;</description>
      <pubDate>Tue, 29 Nov 2022 20:30:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19910#M13416</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-11-29T20:30:49Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19911#M13417</link>
      <description>&lt;P&gt;Dear @Uma Maheswara Rao Desula​&amp;nbsp;Thanks for your help. I am able to split into columns.I have almost 90 columns. So, is there any way that I can automate?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I tried below code, but it is not giving clean seperated columns as ur code.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dff = spark.read.option("header", "true").option("inferSchema", "true").option("delimiter", "‡‡,‡‡").csv(filepath)&lt;/P&gt;&lt;P&gt;dffs_headers = dff.dtypes&lt;/P&gt;&lt;P&gt;for i in dffs_headers:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;columnLabel = i[0]&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\‡‡|\\‡‡$', ''))&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;if columnLabel != newColumnLabel:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;dff = dff.drop(columnLabel)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 06:48:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19911#M13417</guid>
      <dc:creator>shamly</dc:creator>
      <dc:date>2022-11-30T06:48:23Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19912#M13418</link>
      <description>&lt;P&gt;Hi @shamly pt​&amp;nbsp;&lt;/P&gt;&lt;P&gt;May I know what was the data mismatch issue you were facing using the previous code ?&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 09:19:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19912#M13418</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-11-30T09:19:11Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19913#M13419</link>
      <description>&lt;P&gt;Hi @Uma Maheswara Rao Desula​&amp;nbsp;There was no data match issues. It was my mistake in validating data. Data is coming perfectly. The problem is I have 90 columns. So is there any way to reduce manual effort something like below  ?&lt;/P&gt;&lt;P&gt;for i in dffs_headers:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;columnLabel = i[0]&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\‡‡|\\‡‡$', ''))&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;if columnLabel != newColumnLabel:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.drop(columnLabel)&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 10:23:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19913#M13419</guid>
      <dc:creator>shamly</dc:creator>
      <dc:date>2022-11-30T10:23:41Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19914#M13420</link>
      <description>&lt;P&gt;Hi @shamly pt​&amp;nbsp;,&lt;/P&gt;&lt;P&gt;As an automated way if you don't know the schema beforehand would look something like this...&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;dff_1 = spark.read.option("header", "false").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").withColumnRenamed("_c0", "col1")
&amp;nbsp;
split_col = pyspark.sql.functions.split(dff_1['col1'], ',')
&amp;nbsp;
# Building the header field names
&amp;nbsp;
header_uncleaned = (spark.read.option("header", "true").option("inferSchema", "true").option("sep",",").csv("/FileStore/tables/Book1.csv").columns[0]).split(",")
header = []
for i in header_uncleaned:
  header.append(''.join(e for e in i if e.isalnum()))
&amp;nbsp;
# Looping over the column names and populating data
dff_1 = df1
for i in range(len(header)):
  print(i)
  df1 = df1.withColumn(header[i], regexp_replace(split_col.getItem(i), "[^0-9a-zA-Z_\-]+", ""))
df1 = df1.drop("col1").filter(col("companyId") != "companyId")
&amp;nbsp;
display(df1)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 30 Nov 2022 11:04:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19914#M13420</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-11-30T11:04:04Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19915#M13421</link>
      <description>&lt;P&gt;Hi @Uma Maheswara Rao Desula​&amp;nbsp;&lt;/P&gt;&lt;P&gt; I have written this code, as I have many files in many folders in same location and everything is UTF-16. This is giving me proper result as below&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dff = spark.read.option("header", "true") \&lt;/P&gt;&lt;P&gt;.option("inferSchema", "true") \&lt;/P&gt;&lt;P&gt;.option('encoding', 'UTF-16') \&lt;/P&gt;&lt;P&gt;.option('multiline', 'true') \&lt;/P&gt;&lt;P&gt;.option("delimiter", "‡‡,‡‡") \&lt;/P&gt;&lt;P&gt;.csv("/mnt/data/file.csv")&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;display(dff)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;‡‡CompanyId    Companyname     CountryId‡‡&lt;/B&gt;&lt;/P&gt;&lt;P&gt;‡‡1234                  abc                           cn‡‡&lt;/P&gt;&lt;P&gt;‡‡2345                 def                            us‡‡&lt;/P&gt;&lt;P&gt;‡‡3457                  ghi                            sy‡‡&lt;/P&gt;&lt;P&gt;‡‡7564                  lmn                           uk‡‡&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now, I want to remove the start and end double daggers and I wrote below code and it is giving me error &lt;I&gt;"IndentationError: expected an indented block"&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;from pyspark.sql.functions import regexp_replace&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;dffs_headers = dff.dtypes&lt;/P&gt;&lt;P&gt;for i in dffs_headers:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;columnLabel = i[0]&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;newColumnLabel = columnLabel.replace('‡‡','').replace('‡‡','')&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.withColumn(newColumnLabel, regexp_replace(columnLabel, '^\\‡‡|\\‡‡$', ''))&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;if columnLabel != newColumnLabel:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.drop(columnLabel)&lt;/P&gt;&lt;P&gt;display(dff)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;error &lt;I&gt;"IndentationError: expected an indented block"&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 17:57:46 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19915#M13421</guid>
      <dc:creator>shamly</dc:creator>
      <dc:date>2022-12-01T17:57:46Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19916#M13422</link>
      <description>&lt;P&gt;Hi @shamly pt​&amp;nbsp;&lt;/P&gt;&lt;P&gt;It's a indentation error. Check if you are following proper indentation. &lt;/P&gt;&lt;P&gt;I guess your error is at &amp;nbsp;if columnLabel != newColumnLabel:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&amp;nbsp;dff = dff.drop(columnLabel).&lt;/P&gt;&lt;P&gt; Just hit a Tab or give indentation spaces before dff = dff.drop(columnLabel)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;cheers..&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 18:04:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19916#M13422</guid>
      <dc:creator>UmaMahesh1</dc:creator>
      <dc:date>2022-12-01T18:04:06Z</dc:date>
    </item>
    <item>
      <title>Re: need to remove doubledagger delimiter from a csv using databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19917#M13423</link>
      <description>&lt;P&gt;Thankyou it worked &lt;span class="lia-unicode-emoji" title=":slightly_smiling_face:"&gt;🙂&lt;/span&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Dec 2022 18:56:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/need-to-remove-doubledagger-delimiter-from-a-csv-using/m-p/19917#M13423</guid>
      <dc:creator>shamly</dc:creator>
      <dc:date>2022-12-01T18:56:44Z</dc:date>
    </item>
  </channel>
</rss>

