topic How does Spark do lazy evaluation? in Data Engineering

How does Spark do lazy evaluation?

Constantine — Sat, 13 Nov 2021 17:40:42 GMT

For context, I am running Spark on databricks platform and using Delta Tables (s3).

Let's assume we a table called table_one. I create a view called view_one using the table and then call view_one. Next, I create another view, called view_two based on view_one and then call view_two. Will all the calculations be done again for view_one.

Example commands are below i.e. when cmd4 is called, will cmd1 be re-executed to calculate cmd4?

Cmd1:

CREATE OR REPLACE VIEW_ONE FROM 
SELECT 
     ....
FROM
    table_one
WHERE
   .....

Cmd2:

SELECT * FROM VIEW_ONE;

Cmd3:

CREATE OR REPLACE VIEW VIEW_TWO AS
SELECT 
    ....
FROM 
  VIEW_ONE
WHERE
 .....;

Cmd4:

SELECT * FROM VIEW_TWO;

Re: How does Spark do lazy evaluation?

Anonymous — Sat, 13 Nov 2021 20:14:08 GMT

Hello @John Constantine! My name is Piper and I'm a community moderator for Databricks. Welcome to the community and thank you for your question! Let's give it a while to see what other members have to say. 🙂

Re: How does Spark do lazy evaluation?

-werners- — Sun, 14 Nov 2021 06:57:26 GMT

short answer: yes. Spark will run view_one twice.

Unless you cache it (by using delta cache or persist()/cache()).

Re: How does Spark do lazy evaluation?

Prabakar — Mon, 15 Nov 2021 17:33:23 GMT

Hi @John Constantine for delta caching you can refer to the below doc link.

https://docs.databricks.com/delta/optimizations/delta-cache.html

Re: How does Spark do lazy evaluation?

jose_gonzalez — Mon, 15 Nov 2021 19:08:43 GMT

Hi @John Constantine ,

The following notebook url will help you to undertand better the difference between lazy transformations and action in Spark. You will be able to compare the physical query plans and undertand better what is going on when you execute your SQL statements.