<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to solve column header issues in Spark SQL data frame in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29024#M20781</link>
    <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Fair enough... let's try another solution then.&lt;/P&gt;&lt;P&gt;One option would be to use &lt;B&gt;sc.&lt;/B&gt;&lt;B&gt;wholeTextFiles(..)&lt;/B&gt; and simply remove the first line of the file. Personally, I don't like this because each file is forced into one partition and if it is a really large file, then I can have issues with running out of memory.&lt;/P&gt;&lt;P&gt;A second (and third) option revolves around how do I &lt;B&gt;filter()&lt;/B&gt; out row #1 which contains a header and in this case would require having special knowledge about the data. &lt;/P&gt;&lt;P&gt;In your example above, the first column has values like E01, E02, E03, etc. If we can say definitively that the value "ID" is invalid then we can simply exclude any record with that value. You would do this just after you split the CSV file up, or more specifically, after the call to &lt;B&gt;xxx.map()&lt;/B&gt;. It has the disadvantage that you are only checking one column of data and may not be as accurate as you need.&lt;/P&gt;&lt;P&gt;A third option, possibly easier to implement, is to &lt;B&gt;filter()&lt;/B&gt; out any line of text that is a specific value - that is the entire line matches your column headers. In this case, the filter accepts any line that does not equal "ID,Employee_name". You would do this just after the call to &lt;B&gt;sc.textFile() &lt;/B&gt;and before &lt;B&gt;xxx.map()&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;And if you really want to get tricky, you can read in just the first line of your file to determine what the header is, and then use it in the filter with option three above.&lt;/P&gt;</description>
    <pubDate>Tue, 28 Nov 2017 16:43:39 GMT</pubDate>
    <dc:creator>User16857281974</dc:creator>
    <dc:date>2017-11-28T16:43:39Z</dc:date>
    <item>
      <title>How to solve column header issues in Spark SQL data frame</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29021#M20778</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;My code : &lt;/P&gt;
&lt;P&gt;val name = sc.textFile("/FileStore/tables/employeenames.csv") &lt;/P&gt;
&lt;P&gt;case class x(ID:String,Employee_name:String) &lt;/P&gt;
&lt;P&gt;val namePairRDD = name.map(_.split(",")).map(x =&amp;gt; (x(0), x(1).trim.toString)).toDF("ID", "Employee_name") &lt;/P&gt;
&lt;P&gt;namePairRDD.createOrReplaceTempView("namePairRDD") &lt;/P&gt;
&lt;P&gt;val df = sqlContext.sql("SELECT * FROM namePairRDD ") &lt;/P&gt;
&lt;P&gt;df.show()&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="0693f000007OoHrAAK"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/2534i16D7493DEF8A244B/image-size/large?v=v2&amp;amp;px=999" role="button" title="0693f000007OoHrAAK" alt="0693f000007OoHrAAK" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;I dont know how ID and Employee_name shows up in the first row. &lt;/P&gt;
&lt;P&gt;I tried appending the table by making column as headers, that dint work. &lt;/P&gt;
&lt;P&gt;Please help. &lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2017 13:52:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29021#M20778</guid>
      <dc:creator>AnilKumar</dc:creator>
      <dc:date>2017-11-24T13:52:44Z</dc:date>
    </item>
    <item>
      <title>Re: How to solve column header issues in Spark SQL data frame</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29022#M20779</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;First you are bouncing between the RDD and DataFrames API. If you start with a SparkSession object from the DF API instead you can make the call &lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.option("header", "true").csv(path)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;then you don't have to split the file, convert from RDD to DF and the first column will be read as a header instead of as data.&lt;/P&gt;&lt;P&gt;For more information you can also look at the ~20 other options available to the &lt;I&gt;DataFrameReader&lt;/I&gt; (&lt;I&gt;spark.read&lt;/I&gt;)&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2017 16:07:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29022#M20779</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-24T16:07:31Z</dc:date>
    </item>
    <item>
      <title>Re: How to solve column header issues in Spark SQL data frame</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29023#M20780</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I understand we can solve this in multiple ways. I am trying to solve this Question. Hence i am following the flow. &lt;/P&gt;
&lt;P&gt;&lt;B&gt;Write the missing Spark SQL queries to join all the three tables, sort the table, and display the output in the given format: ID, Name, Salary, Manager Name.&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;&lt;B&gt;Instructions:&lt;/B&gt;&lt;/P&gt;
&lt;P&gt;1. Table Names are highlighted in the following queries&lt;/P&gt;
&lt;P&gt;Spark sql&lt;/P&gt;
&lt;P&gt;val manager = sc.textFile("/user/EmployeeManagers")&lt;/P&gt;
&lt;P&gt;val managerRDD = manager.map(x =&amp;gt; (x.split(",")(0), x.split(",")(1)))&lt;/P&gt;
&lt;P&gt;val name = sc.textFile("/user/EmployeeNames")&lt;/P&gt;
&lt;P&gt;val namePairRDD = name.map(x =&amp;gt; (x.split(",")(0), x.split(",")(1)))&lt;/P&gt;
&lt;P&gt;val salary = sc.textFile("/user/EmployeeSalary")&lt;/P&gt;
&lt;P&gt;val salaryPairRDD = salary.map(x =&amp;gt; (x.split(",")(0), x.split(",")(1)))&lt;/P&gt;
&lt;P&gt;&amp;lt;Write your code&amp;gt;&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 24 Nov 2017 16:18:55 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29023#M20780</guid>
      <dc:creator>AnilKumar</dc:creator>
      <dc:date>2017-11-24T16:18:55Z</dc:date>
    </item>
    <item>
      <title>Re: How to solve column header issues in Spark SQL data frame</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29024#M20781</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;P&gt;Fair enough... let's try another solution then.&lt;/P&gt;&lt;P&gt;One option would be to use &lt;B&gt;sc.&lt;/B&gt;&lt;B&gt;wholeTextFiles(..)&lt;/B&gt; and simply remove the first line of the file. Personally, I don't like this because each file is forced into one partition and if it is a really large file, then I can have issues with running out of memory.&lt;/P&gt;&lt;P&gt;A second (and third) option revolves around how do I &lt;B&gt;filter()&lt;/B&gt; out row #1 which contains a header and in this case would require having special knowledge about the data. &lt;/P&gt;&lt;P&gt;In your example above, the first column has values like E01, E02, E03, etc. If we can say definitively that the value "ID" is invalid then we can simply exclude any record with that value. You would do this just after you split the CSV file up, or more specifically, after the call to &lt;B&gt;xxx.map()&lt;/B&gt;. It has the disadvantage that you are only checking one column of data and may not be as accurate as you need.&lt;/P&gt;&lt;P&gt;A third option, possibly easier to implement, is to &lt;B&gt;filter()&lt;/B&gt; out any line of text that is a specific value - that is the entire line matches your column headers. In this case, the filter accepts any line that does not equal "ID,Employee_name". You would do this just after the call to &lt;B&gt;sc.textFile() &lt;/B&gt;and before &lt;B&gt;xxx.map()&lt;/B&gt;.&lt;/P&gt;&lt;P&gt;And if you really want to get tricky, you can read in just the first line of your file to determine what the header is, and then use it in the filter with option three above.&lt;/P&gt;</description>
      <pubDate>Tue, 28 Nov 2017 16:43:39 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29024#M20781</guid>
      <dc:creator>User16857281974</dc:creator>
      <dc:date>2017-11-28T16:43:39Z</dc:date>
    </item>
    <item>
      <title>Re: How to solve column header issues in Spark SQL data frame</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29025#M20782</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi, I have the opposite issue. When I run and SQL query through the bulk download as per the standard prc fobasx notebook, the first row of data somehow gets attached to the column headers. When I import the csv file into R using read_csv, R thinks I have 13 columns whenI in fact only have 7. The last column that should appear has a the observation from the first cell in the second row attached to it's header. Why does this happen and how can I fix it. Thanks.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 03:25:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-solve-column-header-issues-in-spark-sql-data-frame/m-p/29025#M20782</guid>
      <dc:creator>evan_matthews1</dc:creator>
      <dc:date>2019-08-20T03:25:08Z</dc:date>
    </item>
  </channel>
</rss>

