<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Read large volume of parquet files in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62033#M31880</link>
    <description>&lt;P&gt;I Tried that already, got the error like&lt;EM&gt;&lt;STRONG&gt; [CANNOT_MERGE_SCHEMA] failed merging schemas:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;additional error info : &lt;EM&gt;&lt;STRONG&gt;Schema that cannot be merged with the initial schema&lt;/STRONG&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 26 Feb 2024 20:02:00 GMT</pubDate>
    <dc:creator>Shan1</dc:creator>
    <dc:date>2024-02-26T20:02:00Z</dc:date>
    <item>
      <title>Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62022#M31874</link>
      <description>&lt;P&gt;I have 50k + parquet files in the in azure datalake and i have mount point as well. I need to read all the files and load into a dataframe. i have around 2 billion records in total and all the files are not having all the columns, column order may different , column data type may different. I have tried merge schema, inferschema , custom schema with all the columns as string data type. nothing is working. Finally i decided to read all the files into a list and the iterating the files to read one by one. Is this fine or any other best solution available?&lt;/P&gt;&lt;P&gt;from pyspark.sql.types import StructType, StructField, StringType&lt;BR /&gt;from functools import reduce&lt;/P&gt;&lt;P&gt;schema = StructType([&lt;BR /&gt;StructField("COL1", StringType(), nullable=True),&lt;BR /&gt;StructField("COL2", StringType(), nullable=True),&lt;BR /&gt;StructField("COL3", StringType(), nullable=True),&lt;BR /&gt;StructField("COL4", StringType(), nullable=True)&lt;BR /&gt;])&lt;BR /&gt;files = [file.path for file in dbutils.fs.ls("datalake_path_here")]&lt;/P&gt;&lt;P&gt;dfs = []&lt;BR /&gt;def load_data(file_path):&lt;BR /&gt;return spark.read.format("parquet").schema(schema).load(file_path)&lt;BR /&gt;for file_path in files:&lt;BR /&gt;df = load_data(file_path)&lt;BR /&gt;dfs.append(df)&lt;BR /&gt;final_df = reduce(lambda df1, df2: df1.union(df2), dfs)&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 19:37:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62022#M31874</guid>
      <dc:creator>Shan1</dc:creator>
      <dc:date>2024-02-26T19:37:25Z</dc:date>
    </item>
    <item>
      <title>Re: Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62025#M31876</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100914"&gt;@Shan1&lt;/a&gt;&amp;nbsp;- could you please let us know if you need to add a file path column in to the dataframe?&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 19:52:19 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62025#M31876</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-02-26T19:52:19Z</dc:date>
    </item>
    <item>
      <title>Re: Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62026#M31877</link>
      <description>&lt;P&gt;No, its not required to add the file path column into dataframe&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 19:53:48 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62026#M31877</guid>
      <dc:creator>Shan1</dc:creator>
      <dc:date>2024-02-26T19:53:48Z</dc:date>
    </item>
    <item>
      <title>Re: Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62031#M31879</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100914"&gt;@Shan1&lt;/a&gt;&amp;nbsp; - Thanks for the response. can you please try the below and let us know if it works?&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;spark.read.option("mergeSchema", "true").parquet("/path/to/parquet/files")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 19:58:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62031#M31879</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-02-26T19:58:40Z</dc:date>
    </item>
    <item>
      <title>Re: Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62033#M31880</link>
      <description>&lt;P&gt;I Tried that already, got the error like&lt;EM&gt;&lt;STRONG&gt; [CANNOT_MERGE_SCHEMA] failed merging schemas:&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;additional error info : &lt;EM&gt;&lt;STRONG&gt;Schema that cannot be merged with the initial schema&lt;/STRONG&gt;&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2024 20:02:00 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62033#M31880</guid>
      <dc:creator>Shan1</dc:creator>
      <dc:date>2024-02-26T20:02:00Z</dc:date>
    </item>
    <item>
      <title>Re: Read large volume of parquet files</title>
      <link>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62124#M31902</link>
      <description>&lt;P&gt;&lt;SPAN&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/100914"&gt;@Shan1&lt;/a&gt;&amp;nbsp;- This could be due to the files have cols that differ by data type.&amp;nbsp; Eg.&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Integer vs long ,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;Boolean vs integer. can be resolved by schemaMerge=False. Please refer to this code.&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN&gt;&lt;A href="https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L978" target="_blank" rel="noopener"&gt;https://github.com/apache/spark/blob/418bba5ad6053449a141f3c9c31ed3ad998995b8/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala#L978&lt;/A&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 27 Feb 2024 16:37:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/read-large-volume-of-parquet-files/m-p/62124#M31902</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-02-27T16:37:57Z</dc:date>
    </item>
  </channel>
</rss>

