Databricks Community

Shaimaa · ‎06-14-2024

I need to run sql queries against a parquet folder in S3. I am trying to use "read_files" but sometimes my queries fail due to errors while inferring the schema and sometimes without a specified reason.

Sample query:

SELECT 
SUM(CASE WHEN match_result.names IS NOT NULL AND ARRAY_SIZE(match_result.names) !=0 THEN 1 ELSE 0 END)
FROM read_files('s3://folder_path')

How can I enforce the schema successfully and run my query without errors?

shan_chandra · ‎06-14-2024

@Shaimaa - you can divide the query into a nested query to first select all the fields from the s3 by enforcing the schema and build a nested query on top of the below example query (not syntax verified)

SELECT *
  FROM STREAM read_files(
      's3://bucket/path',
      format => 'parquet',
      schema => 'id int, ts timestamp, event string')

holly · ‎06-20-2024

There's a few alternatives for you.

1. a switch in syntax - I doubt this will make much difference, but worth a shot

SELECT ... FROM parquet.`s3://folder_path`

2. Create a view first then query against it. You should get better errors this way.

CREATE TEMPORARY VIEW parquetTable
USING parquet
OPTIONS (
  path "s3://bucket/path",
)

SELECT * FROM parquetTable

3. The clunkiest but most bulletproof. Create an empty Delta table with defined syntax upfront then insert data into it.

CREATE TABLE tableName(
<<your schema here>>
)

INSERT INTO tableName SELECT col_names FROM PARQUET.`s3://folder_path`

Schema inference only infers using the first 1000 rows, if you have more than this, it could explain the failures

Keep in mind that fundamentally parquet doesn't enforce schema on write. You can have anything going into the data and parquet will accept it.

If this becomes an enormous headache, you could build an autoloader pipeline to turn it into Delta files, but if it's a minor pain that happens once a week the syntax above should be enough.

Databricks Community

Running SQL queries against a parquet folder in S3

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon