<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Running SQL queries against a parquet folder in S3 in Warehousing &amp; Analytics</title>
    <link>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/74078#M1388</link>
    <description>&lt;P&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/108540"&gt;@Shaimaa&lt;/a&gt;&amp;nbsp; -&amp;nbsp;you can divide the query into a nested query to first select all the fields from the s3 by enforcing the schema and build a nested query on top of the below example query (not syntax verified)&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format =&amp;gt; 'parquet',
      schema =&amp;gt; 'id int, ts timestamp, event string')&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 14 Jun 2024 20:12:57 GMT</pubDate>
    <dc:creator>shan_chandra</dc:creator>
    <dc:date>2024-06-14T20:12:57Z</dc:date>
    <item>
      <title>Running SQL queries against a parquet folder in S3</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/74051#M1387</link>
      <description>&lt;P&gt;I need to run sql queries against a parquet folder in S3. I am trying to use "read_files" but sometimes my queries fail due to errors while inferring the schema and sometimes without a specified reason.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sample query:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;SELECT 
SUM(CASE WHEN match_result.names IS NOT NULL AND ARRAY_SIZE(match_result.names) !=0 THEN 1 ELSE 0 END)
FROM read_files('s3://folder_path')&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;How can I enforce the schema successfully and run my query without errors?&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jun 2024 17:21:34 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/74051#M1387</guid>
      <dc:creator>Shaimaa</dc:creator>
      <dc:date>2024-06-14T17:21:34Z</dc:date>
    </item>
    <item>
      <title>Re: Running SQL queries against a parquet folder in S3</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/74078#M1388</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/108540"&gt;@Shaimaa&lt;/a&gt;&amp;nbsp; -&amp;nbsp;you can divide the query into a nested query to first select all the fields from the s3 by enforcing the schema and build a nested query on top of the below example query (not syntax verified)&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format =&amp;gt; 'parquet',
      schema =&amp;gt; 'id int, ts timestamp, event string')&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Jun 2024 20:12:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/74078#M1388</guid>
      <dc:creator>shan_chandra</dc:creator>
      <dc:date>2024-06-14T20:12:57Z</dc:date>
    </item>
    <item>
      <title>Re: Running SQL queries against a parquet folder in S3</title>
      <link>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/75210#M1397</link>
      <description>&lt;P&gt;There's a few alternatives for you.&lt;/P&gt;
&lt;P&gt;1. a switch in syntax - I doubt this will make much difference, but worth a shot&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;SELECT ... FROM parquet.`s3://folder_path` &lt;/LI-CODE&gt;
&lt;P&gt;2. Create a view first then query against it. You should get better errors this way.&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class="language-sql" data-lang="sql"&gt;&lt;SPAN class="k"&gt;CREATE&lt;/SPAN&gt; &lt;SPAN class="k"&gt;TEMPORARY&lt;/SPAN&gt; &lt;SPAN class="k"&gt;VIEW&lt;/SPAN&gt; &lt;SPAN class="n"&gt;parquetTable&lt;/SPAN&gt;
&lt;SPAN class="k"&gt;USING&lt;/SPAN&gt; &lt;SPAN class="n"&gt;parquet&lt;/SPAN&gt;
&lt;SPAN class="k"&gt;OPTIONS&lt;/SPAN&gt; &lt;SPAN class="p"&gt;(&lt;/SPAN&gt;
  &lt;SPAN class="n"&gt;path&lt;/SPAN&gt; &lt;SPAN class="nv"&gt;"s3://bucket/path",&lt;/SPAN&gt;
&lt;SPAN class="p"&gt;)&lt;/SPAN&gt;

&lt;SPAN class="k"&gt;SELECT&lt;/SPAN&gt; &lt;SPAN class="o"&gt;*&lt;/SPAN&gt; &lt;SPAN class="k"&gt;FROM&lt;/SPAN&gt; &lt;SPAN class="n"&gt;parquetTable&lt;/SPAN&gt;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;3. The clunkiest but most bulletproof. Create an empty Delta table with defined syntax upfront then insert data into it.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;CREATE TABLE tableName(
&amp;lt;&amp;lt;your schema here&amp;gt;&amp;gt;
)

INSERT INTO tableName SELECT col_names FROM PARQUET.`s3://folder_path`&lt;/LI-CODE&gt;
&lt;P&gt;Schema inference only infers using the first 1000 rows, if you have more than this, it could explain the failures&lt;/P&gt;
&lt;P&gt;Keep in mind that fundamentally parquet doesn't enforce schema on write. You can have anything going into the data and parquet will accept it.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;If this becomes an enormous headache, you could build an autoloader pipeline to turn it into Delta files, but if it's a minor pain that happens once a week the syntax above should be enough.&lt;/P&gt;</description>
      <pubDate>Thu, 20 Jun 2024 14:37:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/warehousing-analytics/running-sql-queries-against-a-parquet-folder-in-s3/m-p/75210#M1397</guid>
      <dc:creator>holly</dc:creator>
      <dc:date>2024-06-20T14:37:05Z</dc:date>
    </item>
  </channel>
</rss>

