cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How do temp views actually work?

gbalboa
New Contributor

So I'm querying data from parquet files that have a couple of billions records (table 1 or t1), and then have to filter and then join with other parquet files with another couple of billions records (t2).

This takes quite a long time to run (like 10hs or so for each query), and I'm seeing that after saving the results of filtering t1 into a temp view, every time I run a query using the results from the temp view, it scans the parquet files again and filters again.

I ended up creating a table in the databricks dbfs and inserting the results of the filtering of t1 and then querying that table for the joins, which runs much faster. So I'm wondering, does the temp view only saves the query but actually runs it every time it's used? Is there a more efficient way to do this without inserting into a table?

1 ACCEPTED SOLUTION

Accepted Solutions

PeteStern
New Contributor III
New Contributor III

Your intuition about views is correct. Views are not materialized, so they are basically just a saved query. Every time you access a view it will have to be recomputed. This is certainly not ideal if it take a long time (like 10hrs) to materialize a view. As you mentioned, the best way of handling this problem is to create a table instead of a view. Use a CREATE TABLE AS (CTAS) statement.

For example:

CREATE TABLE my_db.saved_view AS
( SELECT * FROM my_view WHERE x = z)

Drop the table when you're done with it, and it will all be cleaned up. There's currently not a more efficient way, but I would argue that this is actually pretty efficient. Unlike with the normal file systems you may be used to on your laptop or server, Databricks SQL is backed by a cloud storage system. These operate very fast in parallel with many many disks. So this would be my recommended approach. It also scales basically infinitely.

Another option would be to use CACHE TABLE.

CACHE TABLE my_view

This would work as well, and you would just use the view as if it's a table. The results will be saved now and will not have to be recomputed, however the result is stored in the memory of your cluster and therefore does not persist. So if you spend 10 hours building this view and your cluster shuts down, you're going to have to do it all over again. Also, if your materialized view is too large for the cluster's memory, you're going to have trouble, and your cluster may become unstable. Save yourself the hassle and create the table. You'll preserve your sanity, and it often performs better as well (due to compression, smarter caching, and data skipping, which are all available for tables but not explicitly cached data).

View solution in original post

1 REPLY 1

PeteStern
New Contributor III
New Contributor III

Your intuition about views is correct. Views are not materialized, so they are basically just a saved query. Every time you access a view it will have to be recomputed. This is certainly not ideal if it take a long time (like 10hrs) to materialize a view. As you mentioned, the best way of handling this problem is to create a table instead of a view. Use a CREATE TABLE AS (CTAS) statement.

For example:

CREATE TABLE my_db.saved_view AS
( SELECT * FROM my_view WHERE x = z)

Drop the table when you're done with it, and it will all be cleaned up. There's currently not a more efficient way, but I would argue that this is actually pretty efficient. Unlike with the normal file systems you may be used to on your laptop or server, Databricks SQL is backed by a cloud storage system. These operate very fast in parallel with many many disks. So this would be my recommended approach. It also scales basically infinitely.

Another option would be to use CACHE TABLE.

CACHE TABLE my_view

This would work as well, and you would just use the view as if it's a table. The results will be saved now and will not have to be recomputed, however the result is stored in the memory of your cluster and therefore does not persist. So if you spend 10 hours building this view and your cluster shuts down, you're going to have to do it all over again. Also, if your materialized view is too large for the cluster's memory, you're going to have trouble, and your cluster may become unstable. Save yourself the hassle and create the table. You'll preserve your sanity, and it often performs better as well (due to compression, smarter caching, and data skipping, which are all available for tables but not explicitly cached data).

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.