topic Re: Spark persistent view on a partition parquet file in Data Engineering

Spark persistent view on a partition parquet file

sage5616 — Fri, 08 Jul 2022 15:39:55 GMT

In Spark, is it possible to create a persistent view on a partitioned parquet file in Azure BLOB? The view must be available when the cluster restarted, without having to re-create that view, hence it cannot be a temp view.

I can create a temp view, but not the persistent view. Following code returns an exception.

spark.sql("CREATE VIEW test USING parquet OPTIONS (path \"/mnt/folder/file.c000.snappy.parquet\")")

ParseException: 
mismatched input 'USING' expecting {'(', 'UP_TO_DATE', 'AS', 'COMMENT', 'PARTITIONED', 'TBLPROPERTIES'}(line 1, pos 23)

Big thank you for taking a look 🙂

Re: Spark persistent view on a partition parquet file

tomasz — Fri, 08 Jul 2022 15:56:25 GMT

Have you tried creating an external table on top of the existing parquet data? Views are built on top of existing tables registered in the metastore (not directly on files).

You would use the External table functionality by using LOCATION in your query (https://docs.databricks.com/data-governance/unity-catalog/create-tables.html#create-an-external-table)

Keep in mind that the path specified should be to a directory, not a specific parquet file.

Re: Spark persistent view on a partition parquet file

Hubert-Dudek — Fri, 08 Jul 2022 16:51:32 GMT

VIEW is the implementation of select statements. Please register the parquet as an external TABLE.

Re: Spark persistent view on a partition parquet file

sage5616 — Fri, 08 Jul 2022 17:06:20 GMT

Here is what worked for me. Hope this helps someone else: https://stackoverflow.com/questions/72913913/spark-persistent-view-on-a-partition-parquet-file/72914245#72914245

CREATE VIEW test as select * from parquet.`/mnt/folder-with-parquet-file(s)/`

@Hubert Dudek & @Tomasz Bacewicz unfortunately your answers are not useful.

P.S. I can not hard code the columns or dynamically define table DDL in order to create the external table. I need the schema of the parquet file to be inferred at table creation from the file, without explicitly hard coding the schema ahead.