cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

spark/databricks temporary views and uuid

shadowinc
New Contributor III

Hi All,

We have a table which has an id column generated by uuid(). For ETL we use databricks/spark sql temporary views. we observed strange behavior between databricks sql temp view (create or replace temporary view) and spark sql temp view (df.createorreplacetempview()).

spark sql -  uuid() was evaluated every time and if joined by another table result was weird, uuid generated for 1 primary key column was asscoiated to another, somehow resulting in duplicates uuid()

df= select *, uuid() as id from source_table

df.createorreplacetempview(readData )

df= select * from readData join target_table on primary_key

df.createorreplacetempview(mergePrep  )

databricks SQL -  when using this and the same process, uuid() once generated were fixed and after joining also everything was fine.
readData = create or replace temp view readData  as select *, uuid() as id from source_table
mergePrep  = create or replace temp view mergePrep  as select * from readData join target_table on primary_key

Using databricks sql resolves my issue, however, I want to know how 2 approaches differ from each other while performing same operations. From my research I found that spark SQL df evaluates every time we use select, does that means even after creating temp view it evaluates (underlying nondeterministic functions like uuid), and same doesn't happen when using the databricks SQL method?

Appreciate your support on this. Point me to the right resources. Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @shadowinc,

  1. Creation of Temporary Views:

  2. Evaluation of Expressions:

    • Databricks SQL: Databricks SQL evaluates expressions (such as uuid()) once during view creation and retains the results. Subsequent queries against the view reuse these precomputed values.
    • Spark SQL: When using df.createOrReplaceTempView, expressions (including nondeterministic functions like uuid()) are re-evaluated each time you query the view. This dynamic evaluation can lead to different results, especially if the underlying data changes between queries.
  3. Use Cases:

    • Databricks SQL: Ideal for scenarios where you want to create views from files (e.g., CSV, Parquet) or share views across notebooks within the same cluster.
    • Spark SQL: Useful when you need to work with DataFrames and want to export them as tables for data processing. It’s more flexible but may involve the recomputation of expressions.

Feel free to explore the provided resources for deeper insights! 😊🚀

 

View solution in original post

2 REPLIES 2

Kaniz_Fatma
Community Manager
Community Manager

Hi @shadowinc,

  1. Creation of Temporary Views:

  2. Evaluation of Expressions:

    • Databricks SQL: Databricks SQL evaluates expressions (such as uuid()) once during view creation and retains the results. Subsequent queries against the view reuse these precomputed values.
    • Spark SQL: When using df.createOrReplaceTempView, expressions (including nondeterministic functions like uuid()) are re-evaluated each time you query the view. This dynamic evaluation can lead to different results, especially if the underlying data changes between queries.
  3. Use Cases:

    • Databricks SQL: Ideal for scenarios where you want to create views from files (e.g., CSV, Parquet) or share views across notebooks within the same cluster.
    • Spark SQL: Useful when you need to work with DataFrames and want to export them as tables for data processing. It’s more flexible but may involve the recomputation of expressions.

Feel free to explore the provided resources for deeper insights! 😊🚀

 

Thanks, @Kaniz_Fatma  I suspected that, but could not find any links for confirming it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group