10-14-2021 07:18 AM
Hi,
When creating a Spark view using SparkSQL ("CREATE VIEW AS SELCT ...") per default, this view is non-temporary - the view definition will survive the Spark session as well as the Spark cluster.
In PySpark I can use DataFrame.createOrReplaceTempView or DataFrame.createOrReplaceGlobalTempView to create a temporary view for a DataFrame.
Is there a way to create a non-temporary Spark View with PySpark for a DataFrame programatically?
spark.sql('CREATE VIEW AS SELCT ...') doesn't count 😉
I did not find a DataFrame method to do so...
11-10-2021 11:17 AM
ok now I got it finally 🙂 so whole question is just to create CREATE VIEW AS SELECT via PySpark API.
we can see that in dataframe.py that all views are temp:
this PySpark API is routing to Dataset.scala:
in Dataset.scala we can see condition which should be rebuild-ed. It is not allowing PersistedView
val viewType = if (global) GlobalTempView else LocalTempView
it should be something like in SparkSqlParser.scala
val viewType = if (ctx.TEMPORARY == null) {
PersistedView
} else if (ctx.GLOBAL != null) {
GlobalTempView
} else {
LocalTempView
}
so:
1)
private def createTempViewCommand in Dataset.scala need additional viewType param and rather should be renamed (already name is wrong as Global was added)
2) than functions like createGlobalPersistantCommand etc. ould be added in Dataset.scala
3) than de fcreateGlobalPersistantView etc. could be added to dataframe.py
after it is done in Spark X (? :- ) it will be possible.
Maybe someone want to contribute and create commits 🙂
10-14-2021 08:07 AM
Just create Table instead
10-14-2021 09:45 AM
Creating a table would imply data persistance, wouldn't it?
I don't want that.
10-14-2021 10:18 AM
hi @Martin B. ,
There are 2 types of views. TEMPORARY views are session-scoped and is dropped when session ends because it skips persisting the definition in the underlying metastore, if any. GLOBAL TEMPORARY views are tied to a system preserved temporary database global_temp. If you would like to know more about it, please refer to the docs
If none of these two options work for you, then the other option will be to create a physical table like @Hubert Dudek mentioned.
Thanks
10-16-2021 01:37 AM
Hi @Jose Gonzalez ,
I would argue, there are 3 types of views:
Please see the example here: https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-aux-show-views.htm...
I want to create a non-temporary view (isTemporary=false) - not with SparkSQL but with PySpark.
11-10-2021 05:20 AM
@Jose Gonzalez or @Piper Wilson any ideas?
11-10-2021 05:29 AM
why not to create manage table?
dataframe.write.mode(SaveMode.Overwrite).saveAsTable("<example-table>")
# later when we need data
resultDf = spark.read.table("<example-table>")
11-10-2021 08:24 AM
@Hubert Dudek, creating a managed table means persisting the data frame (writing the content of the dataframe to storage) .
Imagine you have a Spark table (in delta lake format) containing your raw data. Every 5 Minutes there is new data appended to that raw table.
I want to refine the raw data: filter it, select just specific columns and convert some columns to a better data type (e.g string to date). With VIEWSs you can apply this transformations virtually. Every time you access the view, the data are transformed at access time - but you always get the current data from the underlying table.
When creating a managed table you basically create a copy of the content of the raw table with the transformations applied. Persisting data comes at performance and storage costs.
Moreover, every time, I want to access my "clean" version of the data I have to specify the transformation logic again. VIEWs allow me to just access my transformed raw data without the need of manual refresh or persistence.
11-10-2021 11:17 AM
ok now I got it finally 🙂 so whole question is just to create CREATE VIEW AS SELECT via PySpark API.
we can see that in dataframe.py that all views are temp:
this PySpark API is routing to Dataset.scala:
in Dataset.scala we can see condition which should be rebuild-ed. It is not allowing PersistedView
val viewType = if (global) GlobalTempView else LocalTempView
it should be something like in SparkSqlParser.scala
val viewType = if (ctx.TEMPORARY == null) {
PersistedView
} else if (ctx.GLOBAL != null) {
GlobalTempView
} else {
LocalTempView
}
so:
1)
private def createTempViewCommand in Dataset.scala need additional viewType param and rather should be renamed (already name is wrong as Global was added)
2) than functions like createGlobalPersistantCommand etc. ould be added in Dataset.scala
3) than de fcreateGlobalPersistantView etc. could be added to dataframe.py
after it is done in Spark X (? :- ) it will be possible.
Maybe someone want to contribute and create commits 🙂
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group