Databricks

gbalboa · ‎05-19-2022

So I'm querying data from parquet files that have a couple of billions records (table 1 or t1), and then have to filter and then join with other parquet files with another couple of billions records (t2).

This takes quite a long time to run (like 10hs or so for each query), and I'm seeing that after saving the results of filtering t1 into a temp view, every time I run a query using the results from the temp view, it scans the parquet files again and filters again.

I ended up creating a table in the databricks dbfs and inserting the results of the filtering of t1 and then querying that table for the joins, which runs much faster. So I'm wondering, does the temp view only saves the query but actually runs it every time it's used? Is there a more efficient way to do this without inserting into a table?

PeteStern · ‎05-20-2022

Your intuition about views is correct. Views are not materialized, so they are basically just a saved query. Every time you access a view it will have to be recomputed. This is certainly not ideal if it take a long time (like 10hrs) to materialize a view. As you mentioned, the best way of handling this problem is to create a table instead of a view. Use a CREATE TABLE AS (CTAS) statement.

For example:

CREATE TABLE my_db.saved_view AS
( SELECT * FROM my_view WHERE x = z)

Drop the table when you're done with it, and it will all be cleaned up. There's currently not a more efficient way, but I would argue that this is actually pretty efficient. Unlike with the normal file systems you may be used to on your laptop or server, Databricks SQL is backed by a cloud storage system. These operate very fast in parallel with many many disks. So this would be my recommended approach. It also scales basically infinitely.

Another option would be to use CACHE TABLE.

CACHE TABLE my_view

This would work as well, and you would just use the view as if it's a table. The results will be saved now and will not have to be recomputed, however the result is stored in the memory of your cluster and therefore does not persist. So if you spend 10 hours building this view and your cluster shuts down, you're going to have to do it all over again. Also, if your materialized view is too large for the cluster's memory, you're going to have trouble, and your cluster may become unstable. Save yourself the hassle and create the table. You'll preserve your sanity, and it often performs better as well (due to compression, smarter caching, and data skipping, which are all available for tables but not explicitly cached data).

View solution in original post

PeteStern · ‎05-20-2022

Your intuition about views is correct. Views are not materialized, so they are basically just a saved query. Every time you access a view it will have to be recomputed. This is certainly not ideal if it take a long time (like 10hrs) to materialize a view. As you mentioned, the best way of handling this problem is to create a table instead of a view. Use a CREATE TABLE AS (CTAS) statement.

For example:

CREATE TABLE my_db.saved_view AS
( SELECT * FROM my_view WHERE x = z)

Drop the table when you're done with it, and it will all be cleaned up. There's currently not a more efficient way, but I would argue that this is actually pretty efficient. Unlike with the normal file systems you may be used to on your laptop or server, Databricks SQL is backed by a cloud storage system. These operate very fast in parallel with many many disks. So this would be my recommended approach. It also scales basically infinitely.

Another option would be to use CACHE TABLE.

CACHE TABLE my_view

This would work as well, and you would just use the view as if it's a table. The results will be saved now and will not have to be recomputed, however the result is stored in the memory of your cluster and therefore does not persist. So if you spend 10 hours building this view and your cluster shuts down, you're going to have to do it all over again. Also, if your materialized view is too large for the cluster's memory, you're going to have trouble, and your cluster may become unstable. Save yourself the hassle and create the table. You'll preserve your sanity, and it often performs better as well (due to compression, smarter caching, and data skipping, which are all available for tables but not explicitly cached data).

Databricks

How do temp views actually work?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs