topic Re: How to cache on 500 billion rows in Data Engineering

How to cache on 500 billion rows

JoseMacedo — Tue, 26 Mar 2024 13:05:58 GMT

Hello!

I'm using a server less SQL cluster on Data bricks and I have a dataset on Delta Table that has 500 billion rows. I'm trying to filter to have around 7 billion and the cache that dataset to use it on other queries and make it run faster.

When I cache the table it takes 1s and gives no error/warning.

When I select the cache table it gives and error that cannot be found.

This is what I'm doing:

CACHE TABLE table_filtered_cache AS select * from prod_datalake.table a WHERE a.year >= 2023 etc

and then

select count(*) from table_filtered_cache

What am I doing wrong, and what would you advise me to do?

Re: How to cache on 500 billion rows

-werners- — Tue, 26 Mar 2024 13:20:12 GMT

can you try with creating a global temp view of the cache?

Re: How to cache on 500 billion rows

JoseMacedo — Tue, 26 Mar 2024 13:37:31 GMT

I tried doing like:

CREATE GLOBAL TEMPORARY VIEW table_filtered_cache AS select

Got an error

"GLOBAL TEMPORARY VIEW is not supported on a SQL warehouse."

Re: How to cache on 500 billion rows

-werners- — Tue, 26 Mar 2024 13:44:37 GMT

I missed the 'serverless sql' part. CACHE is for spark, I don´t think it works for serverless sql.
Here is how caching works on DBSQL.