<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Skip number of rows when reading CSV files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28065#M19903</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt; According to the docs of &lt;PRE&gt;&lt;CODE&gt;spark.read.csv(...)&lt;/CODE&gt;&lt;/PRE&gt; the &lt;PRE&gt;&lt;CODE&gt;path&lt;/CODE&gt;&lt;/PRE&gt; argument can be an RDD of strings:&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;path : str or list
     string, or list of strings, for input path(s), or RDD of Strings storing CSV rows.
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt; With that, you may use &lt;PRE&gt;&lt;CODE&gt;spark.sparkContext.textFile(...)&lt;/CODE&gt;&lt;/PRE&gt; in combination with &lt;PRE&gt;&lt;CODE&gt;zipWithIndex(...)&lt;/CODE&gt;&lt;/PRE&gt; to perform the necessary row filtering. Putting things together this may look as follows:&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;n_skip_rows = ?
row_rdd = spark.sparkContext
    .textFile(your_csv_file) \
    .zipWithIndex() \
    .filter(lambda row: row[1] &amp;gt;= n_skip_rows) \
    .map(lambda row: row[0])
df = spark_session.read.csv(row_rdd, ...)
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt; Hope that helps.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 11 May 2021 12:23:47 GMT</pubDate>
    <dc:creator>mstuder</dc:creator>
    <dc:date>2021-05-11T12:23:47Z</dc:date>
    <item>
      <title>Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28059#M19897</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;staticDataFrame = spark.read.format("csv")\ .option("header", "true").option("inferSchema", "true").load("/FileStore/tables/Consumption_2019/*.csv")&lt;/P&gt;
&lt;P&gt;when above, I need an option to skip say first 4 lines on each CSV file, How do I do that?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 May 2019 08:49:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28059#M19897</guid>
      <dc:creator>THIAM_HUATTAN</dc:creator>
      <dc:date>2019-05-16T08:49:40Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28060#M19898</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Hi @THIAM HUAT TAN&lt;/P&gt;
&lt;P&gt;I don't think there is a way to specify that when reading it. However, after reading it, you can create monotonically increasing id (new column), and then filter for those ids that are greater than 4.&lt;/P&gt;
&lt;P&gt;Alternatively you can apply take(4) and create rdd out of it. Then apply subtract transformation between the original rdd and the small rdd.&lt;/P&gt;
&lt;P&gt;please let us know whether it works for you&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 May 2019 12:15:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28060#M19898</guid>
      <dc:creator>mathan_pillai</dc:creator>
      <dc:date>2019-05-16T12:15:48Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28061#M19899</link>
      <description>&lt;P&gt;&lt;A href="https://community.databricks.com/s/contentdocument/0693f000007PPcrAAG" alt="https://community.databricks.com/s/contentdocument/0693f000007PPcrAAG" target="_blank"&gt;databricks-data.png&lt;/A&gt;&lt;/P&gt;&lt;P&gt;My sample data is as above, and I need the data from Row 6 onwards, with Row 6 as the header. Row 1 to Row 5 are redundant. Not sure how to implement your suggestion. Thanks.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 17 May 2019 03:47:26 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28061#M19899</guid>
      <dc:creator>THIAM_HUATTAN</dc:creator>
      <dc:date>2019-05-17T03:47:26Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28062#M19900</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I also have same issue. Is it resolved ?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;What is the resolution ?&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;Please advise. Thanks 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 20 Apr 2020 11:39:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28062#M19900</guid>
      <dc:creator>AnkitDwivedi</dc:creator>
      <dc:date>2020-04-20T11:39:25Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28063#M19901</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;any resolution?&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 03 Jun 2020 16:26:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28063#M19901</guid>
      <dc:creator>tony</dc:creator>
      <dc:date>2020-06-03T16:26:21Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28064#M19902</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I resolved it using the function monotonically_increasing_id and a little logic to set the Column Name.&lt;/P&gt;
&lt;P&gt;To do this is necessary Java 1.8, because raise a error on function 'collect()' in Java 11.&lt;/P&gt; 
&lt;PRE&gt;&lt;CODE&gt;df = df.withColumn('index', F.monotonically_increasing_id())
cols = df.columns
values = df.filter('index = 0').collect()  # here define the skipped lines
for i in range(len(cols)):
    if cols[i] != 'index':
        df = df.select(df.columns).withColumnRenamed(cols[i], values[0][i])
&lt;/CODE&gt;&lt;/PRE&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 21 Jun 2020 15:59:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28064#M19902</guid>
      <dc:creator>FabioKfouri</dc:creator>
      <dc:date>2020-06-21T15:59:21Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28065#M19903</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt; According to the docs of &lt;PRE&gt;&lt;CODE&gt;spark.read.csv(...)&lt;/CODE&gt;&lt;/PRE&gt; the &lt;PRE&gt;&lt;CODE&gt;path&lt;/CODE&gt;&lt;/PRE&gt; argument can be an RDD of strings:&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;path : str or list
     string, or list of strings, for input path(s), or RDD of Strings storing CSV rows.
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt; With that, you may use &lt;PRE&gt;&lt;CODE&gt;spark.sparkContext.textFile(...)&lt;/CODE&gt;&lt;/PRE&gt; in combination with &lt;PRE&gt;&lt;CODE&gt;zipWithIndex(...)&lt;/CODE&gt;&lt;/PRE&gt; to perform the necessary row filtering. Putting things together this may look as follows:&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;n_skip_rows = ?
row_rdd = spark.sparkContext
    .textFile(your_csv_file) \
    .zipWithIndex() \
    .filter(lambda row: row[1] &amp;gt;= n_skip_rows) \
    .map(lambda row: row[0])
df = spark_session.read.csv(row_rdd, ...)
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt; Hope that helps.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 11 May 2021 12:23:47 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28065#M19903</guid>
      <dc:creator>mstuder</dc:creator>
      <dc:date>2021-05-11T12:23:47Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28066#M19904</link>
      <description>&lt;P&gt;You can provide the `skipRows` option while reading.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.read.format("csv").option("skipRows", 4).load("&amp;lt;filepath&amp;gt;")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Nov 2022 15:54:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/28066#M19904</guid>
      <dc:creator>User16844409535</dc:creator>
      <dc:date>2022-11-28T15:54:48Z</dc:date>
    </item>
    <item>
      <title>Re: Skip number of rows when reading CSV files</title>
      <link>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/45496#M27898</link>
      <description>&lt;P&gt;The option...&amp;nbsp;&lt;/P&gt;&lt;PRE&gt;.option("skipRows", &amp;lt;number of rows to skip&amp;gt;)&amp;nbsp;&lt;/PRE&gt;&lt;P&gt;...works for me as well. However, I am surprised that the official Spark doc does not list it as a CSV Data Source Option:&amp;nbsp;&lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option" target="_blank" rel="noopener"&gt;https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/30030"&gt;@User16844409535&lt;/a&gt;&amp;nbsp;Did you find documentation on that somewhere else?&lt;/P&gt;</description>
      <pubDate>Thu, 21 Sep 2023 07:48:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/skip-number-of-rows-when-reading-csv-files/m-p/45496#M27898</guid>
      <dc:creator>Michael_Appiah</dc:creator>
      <dc:date>2023-09-21T07:48:58Z</dc:date>
    </item>
  </channel>
</rss>

