<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to read file in pyspark with “]|[” delimiter in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29322#M21062</link>
    <description>&lt;P&gt;you might also try the blow option.&lt;/P&gt;&lt;P&gt;1). Use a different file format: You can try using a different file format that supports multi-character delimiters, such as text JSON.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2). Use a custom Row class: You can write a custom Row class to parse the multi-character delimiter yourself, and then use the spark.read.text API to read the file as text. You will then need to apply the custom Row class to each line in the text file to extract the values&lt;/P&gt;</description>
    <pubDate>Wed, 01 Feb 2023 06:59:58 GMT</pubDate>
    <dc:creator>rohit199912</dc:creator>
    <dc:date>2023-02-01T06:59:58Z</dc:date>
    <item>
      <title>How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29318#M21058</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The data looks like this:&lt;/P&gt;
&lt;P&gt;&lt;PRE&gt;&lt;CODE&gt;pageId]|[page]|[Position]|[sysId]|[carId 0005]|[bmw]|[south]|[AD6]|[OP4&lt;/CODE&gt;&lt;/PRE&gt;&lt;/P&gt;
&lt;P&gt;There are atleast 50 columns and millions of rows.&lt;/P&gt;
&lt;P&gt;I did try to use below code to read:&lt;/P&gt;
&lt;P&gt;&lt;PRE&gt;&lt;CODE&gt;dff = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "]|[").load(trainingdata+"part-00000")&lt;/CODE&gt;&lt;/PRE&gt;&lt;/P&gt;
&lt;P&gt;it gives me following error:&lt;/P&gt;
&lt;P&gt;&lt;PRE&gt;&lt;CODE&gt;IllegalArgumentException: u'Delimiter cannot be more than one character: ]|['&lt;/CODE&gt;&lt;/PRE&gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 21:14:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29318#M21058</guid>
      <dc:creator>lambarc</dc:creator>
      <dc:date>2017-01-18T21:14:50Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29319#M21059</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Sorry if this notifies everyone again - struggling with the text editor here!&lt;/P&gt;
&lt;P&gt;I'm not sure of a workaround for directly splitting with more than one character. 2 ways around this that I can see: &lt;/P&gt;
&lt;P&gt;The first approach would be to split using "|" and then replace the angle brackets: &lt;/P&gt;dff = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").option("delimiter", "|").load(trainingdata+"part-00000")
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;from pyspark.sql.functions import regexp_replace dff = dff.withColumn('pageId', regexp_replace('[pageId]','[[]]',''))....other columns.... &lt;/P&gt;
&lt;P&gt;Given that you have 50 columns you might want to loop through this: &lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;dffs_headers = dff.dtypes
for i in dffs_headers: 
    newColumnLabel = i[0].replace('[','').replace(']','') 
    dff = dff.withColumn(newColumnLabel, regexp_replace(i[0],'[[]]','')).drop(i[0]) &lt;/CODE&gt;&lt;/PRE&gt;You'll still need to build a function or add cases to this loop to correctly cast each column though. The second way would be go via RDD: 
&lt;PRE&gt;&lt;CODE&gt;dff = sc.textfile(trainingdata+"part-00000").map(lambda x: x.replace('[','').replace(']','').split('|')).toDF() &lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;But that still leaves you with the casting problem Hope that helps!&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Jan 2017 17:25:33 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29319#M21059</guid>
      <dc:creator>SamKlingner</dc:creator>
      <dc:date>2017-01-19T17:25:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29320#M21060</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;The code block button just isn't playing nice, sorry.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Jan 2017 17:26:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29320#M21060</guid>
      <dc:creator>SamKlingner</dc:creator>
      <dc:date>2017-01-19T17:26:21Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29321#M21061</link>
      <description>&lt;P&gt;val df = spark.read.format("csv")&lt;/P&gt;&lt;P&gt;              .option("header",true)&lt;/P&gt;&lt;P&gt;                .option("sep","||")&lt;/P&gt;&lt;P&gt;                  .load("file load")&lt;/P&gt;&lt;P&gt;display(df)&amp;nbsp; &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;try this&lt;/P&gt;</description>
      <pubDate>Wed, 11 Jan 2023 17:39:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29321#M21061</guid>
      <dc:creator>sher</dc:creator>
      <dc:date>2023-01-11T17:39:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29322#M21062</link>
      <description>&lt;P&gt;you might also try the blow option.&lt;/P&gt;&lt;P&gt;1). Use a different file format: You can try using a different file format that supports multi-character delimiters, such as text JSON.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;2). Use a custom Row class: You can write a custom Row class to parse the multi-character delimiter yourself, and then use the spark.read.text API to read the file as text. You will then need to apply the custom Row class to each line in the text file to extract the values&lt;/P&gt;</description>
      <pubDate>Wed, 01 Feb 2023 06:59:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29322#M21062</guid>
      <dc:creator>rohit199912</dc:creator>
      <dc:date>2023-02-01T06:59:58Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29323#M21063</link>
      <description>&lt;P&gt;this one works. thanks.&lt;/P&gt;</description>
      <pubDate>Fri, 03 Feb 2023 11:39:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29323#M21063</guid>
      <dc:creator>Rajeev_Basu</dc:creator>
      <dc:date>2023-02-03T11:39:00Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29324#M21064</link>
      <description>&lt;P&gt;Might be usefull ​&lt;/P&gt;</description>
      <pubDate>Sat, 04 Feb 2023 12:37:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29324#M21064</guid>
      <dc:creator>Meghala</dc:creator>
      <dc:date>2023-02-04T12:37:14Z</dc:date>
    </item>
    <item>
      <title>Re: How to read file in pyspark with “]|[” delimiter</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29325#M21065</link>
      <description>&lt;P&gt;Yes this one is useful but what if we need to use it in CSV format only than is there any other query if you can share.​ @ROHIT AGARWAL​&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 04 Feb 2023 14:12:59 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-read-file-in-pyspark-with-delimiter/m-p/29325#M21065</guid>
      <dc:creator>Manoj12421</dc:creator>
      <dc:date>2023-02-04T14:12:59Z</dc:date>
    </item>
  </channel>
</rss>

