โ05-18-2024 03:30 AM - edited โ05-18-2024 08:21 AM
Hi All,
We have a table which has an id column generated by uuid(). For ETL we use databricks/spark sql temporary views. we observed strange behavior between databricks sql temp view (create or replace temporary view) and spark sql temp view (df.createorreplacetempview()).
spark sql - uuid() was evaluated every time and if joined by another table result was weird, uuid generated for 1 primary key column was asscoiated to another, somehow resulting in duplicates uuid()
df= select *, uuid() as id from source_table
df.createorreplacetempview(readData )
df= select * from readData join target_table on primary_key
df.createorreplacetempview(mergePrep )
databricks SQL - when using this and the same process, uuid() once generated were fixed and after joining also everything was fine.
readData = create or replace temp view readData as select *, uuid() as id from source_table
mergePrep = create or replace temp view mergePrep as select * from readData join target_table on primary_key
Using databricks sql resolves my issue, however, I want to know how 2 approaches differ from each other while performing same operations. From my research I found that spark SQL df evaluates every time we use select, does that means even after creating temp view it evaluates (underlying nondeterministic functions like uuid), and same doesn't happen when using the databricks SQL method?
Appreciate your support on this. Point me to the right resources. Thanks
โ05-20-2024 03:09 AM
Hi @shadowinc,
Creation of Temporary Views:
createOrReplaceTempView
method creates a temporary view scoped to the SparkSession. It gets dropped when the session closes2.Evaluation of Expressions:
uuid()
) once during view creation and retains the results. Subsequent queries against the view reuse these precomputed values.df.createOrReplaceTempView
, expressions (including nondeterministic functions like uuid()
) are re-evaluated each time you query the view. This dynamic evaluation can lead to different results, especially if the underlying data changes between queries.Use Cases:
Feel free to explore the provided resources for deeper insights! ๐๐
โ05-20-2024 03:09 AM
Hi @shadowinc,
Creation of Temporary Views:
createOrReplaceTempView
method creates a temporary view scoped to the SparkSession. It gets dropped when the session closes2.Evaluation of Expressions:
uuid()
) once during view creation and retains the results. Subsequent queries against the view reuse these precomputed values.df.createOrReplaceTempView
, expressions (including nondeterministic functions like uuid()
) are re-evaluated each time you query the view. This dynamic evaluation can lead to different results, especially if the underlying data changes between queries.Use Cases:
Feel free to explore the provided resources for deeper insights! ๐๐
โ05-20-2024 05:03 AM
Thanks, @Kaniz_Fatma I suspected that, but could not find any links for confirming it.
Excited to expand your horizons with us? Click here to Register and begin your journey to success!
Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!