cancel
Showing results for 
Search instead for 
Did you mean: 
Warehousing & Analytics
Engage in discussions on data warehousing, analytics, and BI solutions within the Databricks Community. Share insights, tips, and best practices for leveraging data for informed decision-making.
cancel
Showing results for 
Search instead for 
Did you mean: 

Running SQL queries against a parquet folder in S3

Shaimaa
New Contributor II

I need to run sql queries against a parquet folder in S3. I am trying to use "read_files" but sometimes my queries fail due to errors while inferring the schema and sometimes without a specified reason. 

Sample query:

 

 

SELECT 
SUM(CASE WHEN match_result.names IS NOT NULL AND ARRAY_SIZE(match_result.names) !=0 THEN 1 ELSE 0 END)
FROM read_files('s3://folder_path')

 

 

How can I enforce the schema successfully and run my query without errors?

2 REPLIES 2

shan_chandra
Esteemed Contributor

 @Shaimaa  - you can divide the query into a nested query to first select all the fields from the s3 by enforcing the schema and build a nested query on top of the below example query (not syntax verified)

SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format => 'parquet',
      schema => 'id int, ts timestamp, event string')

 

holly
Valued Contributor III

There's a few alternatives for you.

1. a switch in syntax - I doubt this will make much difference, but worth a shot

SELECT ... FROM parquet.`s3://folder_path` 

2. Create a view first then query against it. You should get better errors this way. 

CREATE TEMPORARY VIEW parquetTable
USING parquet
OPTIONS (
  path "s3://bucket/path",
)

SELECT * FROM parquetTable

3. The clunkiest but most bulletproof. Create an empty Delta table with defined syntax upfront then insert data into it.

CREATE TABLE tableName(
<<your schema here>>
)

INSERT INTO tableName SELECT col_names FROM PARQUET.`s3://folder_path`

Schema inference only infers using the first 1000 rows, if you have more than this, it could explain the failures

Keep in mind that fundamentally parquet doesn't enforce schema on write. You can have anything going into the data and parquet will accept it. 

If this becomes an enormous headache, you could build an autoloader pipeline to turn it into Delta files, but if it's a minor pain that happens once a week the syntax above should be enough.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group