Databricks Community

MartinB · ‎10-14-2021

Hi,

When creating a Spark view using SparkSQL ("CREATE VIEW AS SELCT ...") per default, this view is non-temporary - the view definition will survive the Spark session as well as the Spark cluster.

In PySpark I can use DataFrame.createOrReplaceTempView or DataFrame.createOrReplaceGlobalTempView to create a temporary view for a DataFrame.

Is there a way to create a non-temporary Spark View with PySpark for a DataFrame programatically?

spark.sql('CREATE VIEW AS SELCT ...') doesn't count 😉

I did not find a DataFrame method to do so...

Hubert-Dudek · ‎11-10-2021

ok now I got it finally 🙂 so whole question is just to create CREATE VIEW AS SELECT via PySpark API.

we can see that in dataframe.py that all views are temp:

https://github.com/apache/spark/blob/7d9a4fab7957086a12ba3e9e2856e20566531e3a/python/pyspark/sql/dat...

this PySpark API is routing to Dataset.scala:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

in Dataset.scala we can see condition which should be rebuild-ed. It is not allowing PersistedView

val viewType = if (global) GlobalTempView else LocalTempView

it should be something like in SparkSqlParser.scala

      val viewType = if (ctx.TEMPORARY == null) {
        PersistedView
      } else if (ctx.GLOBAL != null) {
        GlobalTempView
      } else {
        LocalTempView
      }

so:

1)

private def createTempViewCommand in Dataset.scala need additional viewType param and rather should be renamed (already name is wrong as Global was added)

2) than functions like createGlobalPersistantCommand etc. ould be added in Dataset.scala

3) than de fcreateGlobalPersistantView etc. could be added to dataframe.py

after it is done in Spark X (? :- ) it will be possible.

Maybe someone want to contribute and create commits 🙂

View solution in original post

Hubert-Dudek · ‎10-14-2021

Just create Table instead

MartinB · ‎10-14-2021

Creating a table would imply data persistance, wouldn't it?

I don't want that.

jose_gonzalez · ‎10-14-2021

hi @Martin B. ,

There are 2 types of views. TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved temporary database global_temp. If you would like to know more about it, please refer to the docs

If none of these two options work for you, then the other option will be to create a physical table like @Hubert Dudek mentioned.

Thanks

MartinB · ‎10-16-2021

Hi @Jose Gonzalez ,

I would argue, there are 3 types of views:

TEMPORARY VIEWS
- CREATE TEMPORARY VIEW sam AS SELECT * FROM ...
GLOBAL TEMPORARY VIEWS
- CREATE GLOBAL TEMPORARY VIEW sam AS SELECT * FROM ...
NON-TEMPORARY VIEWS
- CREATE VIEW sam AS SELECT * FROM ...

Please see the example here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-aux-show-views.htm...

I want to create a non-temporary view (isTemporary=false) - not with SparkSQL but with PySpark.

MartinB · ‎11-10-2021

@Jose Gonzalez or @Piper Wilson any ideas?

Hubert-Dudek · ‎11-10-2021

why not to create manage table?

dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")
 
# later when we need data
resultDf = spark.read.table("<example-table>")

MartinB · ‎11-10-2021

@Hubert Dudek, creating a managed table means persisting the data frame (writing the content of the dataframe to storage) .

Imagine you have a Spark table (in delta lake format) containing your raw data. Every 5 Minutes there is new data appended to that raw table.

I want to refine the raw data: filter it, select just specific columns and convert some columns to a better data type (e.g string to date). With VIEWSs you can apply this transformations virtually. Every time you access the view, the data are transformed at access time - but you always get the current data from the underlying table.

When creating a managed table you basically create a copy of the content of the raw table with the transformations applied. Persisting data comes at performance and storage costs.

Moreover, every time, I want to access my "clean" version of the data I have to specify the transformation logic again. VIEWs allow me to just access my transformed raw data without the need of manual refresh or persistence.

Hubert-Dudek · ‎11-10-2021