cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is there a way to create a non-temporary Spark View with PySpark?

MartinB
Contributor III

Hi,

When creating a Spark view using SparkSQL ("CREATE VIEW AS SELCT ...") per default, this view is non-temporary - the view definition will survive the Spark session as well as the Spark cluster.

In PySpark I can use DataFrame.createOrReplaceTempView or DataFrame.createOrReplaceGlobalTempView to create a temporary view for a DataFrame.

Is there a way to create a non-temporary Spark View with PySpark for a DataFrame programatically?

spark.sql('CREATE VIEW AS SELCT ...') doesn't count ๐Ÿ˜‰

I did not find a DataFrame method to do so...

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

ok now I got it finally ๐Ÿ™‚ so whole question is just to create CREATE VIEW AS SELECT via PySpark API.

we can see that in dataframe.py that all views are temp:

https://github.com/apache/spark/blob/7d9a4fab7957086a12ba3e9e2856e20566531e3a/python/pyspark/sql/dat...

this PySpark API is routing to Dataset.scala:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

in Dataset.scala we can see condition which should be rebuild-ed. It is not allowing PersistedView

val viewType = if (global) GlobalTempView else LocalTempView

it should be something like in SparkSqlParser.scala

      val viewType = if (ctx.TEMPORARY == null) {
        PersistedView
      } else if (ctx.GLOBAL != null) {
        GlobalTempView
      } else {
        LocalTempView
      }

so:

1)

private def createTempViewCommand in Dataset.scala need additional viewType param and rather should be renamed (already name is wrong as Global was added)

2) than functions like createGlobalPersistantCommand etc. ould be added in Dataset.scala

3) than de fcreateGlobalPersistantView etc. could be added to dataframe.py

after it is done in Spark X (? :- ) it will be possible.

Maybe someone want to contribute and create commits ๐Ÿ™‚

View solution in original post

8 REPLIES 8

Hubert-Dudek
Esteemed Contributor III

Just create Table instead

MartinB
Contributor III

Creating a table would imply data persistance, wouldn't it?

I don't want that.

jose_gonzalez
Databricks Employee
Databricks Employee

hi @Martin B.โ€‹ ,

There are 2 types of views. TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved temporary database global_temp. If you would like to know more about it, please refer to the docs

If none of these two options work for you, then the other option will be to create a physical table like @Hubert Dudekโ€‹  mentioned.

Thanks

MartinB
Contributor III

Hi @Jose Gonzalezโ€‹ ,

I would argue, there are 3 types of views:

  • TEMPORARY VIEWS
    • CREATE TEMPORARY VIEW sam AS SELECT * FROM ...
  • GLOBAL TEMPORARY VIEWS
    • CREATE GLOBAL TEMPORARY VIEW sam AS SELECT * FROM ...
  • NON-TEMPORARY VIEWS
    • CREATE VIEW sam AS SELECT * FROM ...

Please see the example here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-aux-show-views.htm...

imageI want to create a non-temporary view (isTemporary=false) - not with SparkSQL but with PySpark.

MartinB
Contributor III

@Jose Gonzalezโ€‹  or @Piper Wilsonโ€‹  any ideas?

Hubert-Dudek
Esteemed Contributor III

why not to create manage table?

dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")
 
# later when we need data
resultDf = spark.read.table("<example-table>")

@Hubert Dudekโ€‹, creating a managed table means persisting the data frame (writing the content of the dataframe to storage) .

Imagine you have a Spark table (in delta lake format) containing your raw data. Every 5 Minutes there is new data appended to that raw table.

I want to refine the raw data: filter it, select just specific columns and convert some columns to a better data type (e.g string to date). With VIEWSs you can apply this transformations virtually. Every time you access the view, the data are transformed at access time - but you always get the current data from the underlying table.

When creating a managed table you basically create a copy of the content of the raw table with the transformations applied. Persisting data comes at performance and storage costs.

Moreover, every time, I want to access my "clean" version of the data I have to specify the transformation logic again. VIEWs allow me to just access my transformed raw data without the need of manual refresh or persistence.

Hubert-Dudek
Esteemed Contributor III

ok now I got it finally ๐Ÿ™‚ so whole question is just to create CREATE VIEW AS SELECT via PySpark API.

we can see that in dataframe.py that all views are temp:

https://github.com/apache/spark/blob/7d9a4fab7957086a12ba3e9e2856e20566531e3a/python/pyspark/sql/dat...

this PySpark API is routing to Dataset.scala:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.sca...

in Dataset.scala we can see condition which should be rebuild-ed. It is not allowing PersistedView

val viewType = if (global) GlobalTempView else LocalTempView

it should be something like in SparkSqlParser.scala

      val viewType = if (ctx.TEMPORARY == null) {
        PersistedView
      } else if (ctx.GLOBAL != null) {
        GlobalTempView
      } else {
        LocalTempView
      }

so:

1)

private def createTempViewCommand in Dataset.scala need additional viewType param and rather should be renamed (already name is wrong as Global was added)

2) than functions like createGlobalPersistantCommand etc. ould be added in Dataset.scala

3) than de fcreateGlobalPersistantView etc. could be added to dataframe.py

after it is done in Spark X (? :- ) it will be possible.

Maybe someone want to contribute and create commits ๐Ÿ™‚

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group