PySpark Lazy Evaluation - Why does my logging function seem to execute without an explicit action in Databricks?
Hello everyone,
I was scrolling and found some Medium post on a PySpark (https://medium.com/@sudeepwrites/pyspark-secrets-no-one-talks-about-but-every-data-engineer-should-k...) and have a question about lazy evaluation. It is written that transformations are lazy and will not execute until an action is called (which I know).
Arctical have some code I have tried to execute, according to article it should not print 'Logging something...' ,However, It is printing.
from pyspark.sql.functions import coldef log_step(df): print("Logging something...") return df df = spark.sql("select * from delta.`s3://abc` limit 10") log_step(df.filter(col("flag_source_system_delete") == "yes")) |
Even without an explicit action like .show() or .count() on the final line, the print() statement inside log_step executes and I see the output in my notebook. My understanding is that the filter is a transformation and should not trigger the code.
Can someone please explain why this is happening? Is there an implicit action being triggered by the Databricks notebook environment, or am I fundamentally misunderstanding something about lazy evaluation with functions?
Thank you!