topic Re: Running SQL queries against a parquet folder in S3 in Warehousing & Analytics

Running SQL queries against a parquet folder in S3

Shaimaa — Fri, 14 Jun 2024 17:21:34 GMT

I need to run sql queries against a parquet folder in S3. I am trying to use "read_files" but sometimes my queries fail due to errors while inferring the schema and sometimes without a specified reason.

Sample query:

SELECT SUM(CASE WHEN match_result.names IS NOT NULL AND ARRAY_SIZE(match_result.names) !=0 THEN 1 ELSE 0 END) FROM read_files('s3://folder_path')

How can I enforce the schema successfully and run my query without errors?

Re: Running SQL queries against a parquet folder in S3

shan_chandra — Fri, 14 Jun 2024 20:12:57 GMT

@Shaimaa - you can divide the query into a nested query to first select all the fields from the s3 by enforcing the schema and build a nested query on top of the below example query (not syntax verified)

SELECT * FROM STREAM read_files( 's3://bucket/path', format => 'parquet', schema => 'id int, ts timestamp, event string')

Re: Running SQL queries against a parquet folder in S3

holly — Thu, 20 Jun 2024 14:37:05 GMT

There's a few alternatives for you.

1. a switch in syntax - I doubt this will make much difference, but worth a shot

SELECT ... FROM parquet.`s3://folder_path`

2. Create a view first then query against it. You should get better errors this way.

CREATE TEMPORARY VIEW parquetTable
USING parquet
OPTIONS (
  path "s3://bucket/path",
)

SELECT * FROM parquetTable

3. The clunkiest but most bulletproof. Create an empty Delta table with defined syntax upfront then insert data into it.

CREATE TABLE tableName( <<your schema here>> ) INSERT INTO tableName SELECT col_names FROM PARQUET.`s3://folder_path`

Schema inference only infers using the first 1000 rows, if you have more than this, it could explain the failures

Keep in mind that fundamentally parquet doesn't enforce schema on write. You can have anything going into the data and parquet will accept it.

If this becomes an enormous headache, you could build an autoloader pipeline to turn it into Delta files, but if it's a minor pain that happens once a week the syntax above should be enough.