cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

does delta live tables supports identity columns

karthik_p
Esteemed Contributor

we are able to test identity columns using sql/python, but when we are trying same using DLT, we are not seeing values under identity column. it is always empty for coloumn we created "id BIGINT GENERATED ALWAYS AS IDENTITY"

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @karthik_p, Letโ€™s dive into the nuances of Delta Live Tables (DLT) and its limitations regarding identity columns.

 

Identity Columns and DLT:

  • Identity columns are unique, auto-incrementing columns that assign a new value to each record inserted into a table. However, there are some considerations when using identity columns with DLT:
    • Supported Use Case: DLT recommends using identity columns primarily with streaming tables.
    • Not Supported for APPLY CHANGES INTO: Identity columns are not supported with tables that are the target of APPLY CHANGES INTO operations. This means that if you intend to use DLT for incremental updates (via APPLY CHANGES INTO), avoid using identity columns in those target tables.
    • Materialized Views (MV): While you mentioned that you were able to create materialized views using data from data frames, itโ€™s essential to be aware of the potential recomputation issue. Identity columns in materialized views might be recomputed during updates, which could impact the correctness of your MVs.

Materialized Views and Identity Columns:

  • Despite the limitation mentioned, you were able to create materialized views successfully. This is because DLT allows you to define materialized views using data frames, even though the underlying tables may have identity columns.
  • However, keep in mind that the recomputation behavior might affect the consistency of your materialized views. If the identity columns are updated, it could lead to unexpected results in the MVs.

Considerations:

  • Streaming Use Cases: If your primary use case involves streaming data, DLT with identity columns can work well.
  • Materialized Views: While itโ€™s possible to create MVs, be cautious about the recomputation behavior. Ensure that the identity columnsโ€™ updates wonโ€™t compromise the correctness of your MVs.

In summary, DLTโ€™s recommendation to use identity columns primarily with streaming tables doesnโ€™t mean you canโ€™t create materialized views. However, be aware of the limitations and potential implications when using identity columns in materialized views. Always test thoroughly to ensure the desired behaviour in your specific scenarios. ๐Ÿš€

View solution in original post

6 REPLIES 6

Kaniz_Fatma
Community Manager
Community Manager

Hi @karthik_p, When working with Delta Live Tables (DLT), implementing an identity column (similar to an auto-incrementing primary key) can be achieved using Python syntax.

Letโ€™s dive into how you can set up an identity column in DLT using Python:

 

  1. First, ensure you have the necessary DLT library imported in your Python environment.
  2. Define your table schema, including the identity column. For example, letโ€™s say we have a sales table with the following columns:
    • customer_id (STRING)
    • customer_name (STRING)
    • number_of_line_items (STRING)
    • order_datetime (STRING)
    • order_number (LONG)
    • order_day_of_week (STRING) โ€“ This column will be generated as the day of the week from the order_datetime.

You can find more details and examples in the official Databricks documentation.

Remember to consider whether youโ€™re dealing with a streaming table or not, as this can impact how th...d. Happy DLT-ing! ๐Ÿš€

karthik_p
Esteemed Contributor

Hi @Kaniz_Fatma @above example is for materialized view and also it uses String type and as per documentation DLT, identity column only supports 

1. streaming use case

2. sql syntax  streaming is not supported 

I am not seeing any python example to add column on that fly or update. Also streaming uses inference, I tried to use merge schema but still same issue. If we are able to change schema on the fly for python that should confirm about identity column please

Hi @karthik_p, I apologize for any confusion. Let's address your concerns regarding Delta Lake and identity columns in the context of Python.

Identity Columns in Delta Lake:

  • Identity columns are a powerful feature for generating surrogate keys in Delta Lake.
  • However, it's essential to note that identity columns are primarily designed for batch processing scenarios, not streaming use cases.
  • Streaming scenarios have complexities, and identity columns may not directly apply there.

Python Example for Identity Columns in Delta Lake:

  • Let's create a simple example to demonstrate how to use identity columns in Delta Lake using Python.
  • Suppose we have a sales table with columns like customer_id, customer_name, number_of_line_items, order_datetime, and order_number.
  • We'll create an identity column called order_day_of_week based on the day of the week from the order_datetime.

# Example schema definition with an identity column

# Example schema definition with an identity column
@dlt.table(
    comment="Raw data on sales",
    schema="""
        customer_id STRING,
        customer_name STRING,
        number_of_line_items STRING,
        order_datetime STRING,
        order_number LONG,
        order_day_of_week STRING GENERATED ALWAYS AS (dayofweek(order_datetime))
    """,
    partition_cols=["order_day_of_week"]
)
def sales():
    return "..."  # Your data loading logic here

# You can then insert new data into the 'sales' table
# The 'order_day_of_week' identity column will auto-increment
  1. Important Considerations:
    • Be cautious about how identity columns are generated, especially when dealing with streaming tables.
    • Refer to the Delta Live Tables Python reference for guidance on schema definitions and identity columns for streaming scenarios.

Remember that identity columns provide a convenient way to generate surrogate keys, but their usage should align with the specific requirements of your data processing workflows.

Feel free to ask if you encounter any further issues or need additional assistance! ๐Ÿš€

Hi Kaniz,

I tried the solution above with streaming tables. I defined sales as returning a streaming table, but the identity columns are not populated. They always show as null. Any help is much appreciated.

karthik_p
Esteemed Contributor

@Kaniz_Fatma thank you for quick response, we are able to generate for streaming and materialized views. but only confusion that i am seeing is, in terms of limitations that are mentioned in DLT 

Identity columns are not supported with tables that are the target of APPLY CHANGES INTO and might be recomputed during updates for materialized views. For this reason, Databricks recommends only using identity columns with streaming tables in Delta Live Tables. See Use identity columns in Delta Lake.

Based on above limitation, does this means do we need to use only for streaming use cases. when we tested for creation of Materialized Views also it worked ( data from data frames are passed into MV)

Hi @karthik_p, Letโ€™s dive into the nuances of Delta Live Tables (DLT) and its limitations regarding identity columns.

 

Identity Columns and DLT:

  • Identity columns are unique, auto-incrementing columns that assign a new value to each record inserted into a table. However, there are some considerations when using identity columns with DLT:
    • Supported Use Case: DLT recommends using identity columns primarily with streaming tables.
    • Not Supported for APPLY CHANGES INTO: Identity columns are not supported with tables that are the target of APPLY CHANGES INTO operations. This means that if you intend to use DLT for incremental updates (via APPLY CHANGES INTO), avoid using identity columns in those target tables.
    • Materialized Views (MV): While you mentioned that you were able to create materialized views using data from data frames, itโ€™s essential to be aware of the potential recomputation issue. Identity columns in materialized views might be recomputed during updates, which could impact the correctness of your MVs.

Materialized Views and Identity Columns:

  • Despite the limitation mentioned, you were able to create materialized views successfully. This is because DLT allows you to define materialized views using data frames, even though the underlying tables may have identity columns.
  • However, keep in mind that the recomputation behavior might affect the consistency of your materialized views. If the identity columns are updated, it could lead to unexpected results in the MVs.

Considerations:

  • Streaming Use Cases: If your primary use case involves streaming data, DLT with identity columns can work well.
  • Materialized Views: While itโ€™s possible to create MVs, be cautious about the recomputation behavior. Ensure that the identity columnsโ€™ updates wonโ€™t compromise the correctness of your MVs.

In summary, DLTโ€™s recommendation to use identity columns primarily with streaming tables doesnโ€™t mean you canโ€™t create materialized views. However, be aware of the limitations and potential implications when using identity columns in materialized views. Always test thoroughly to ensure the desired behaviour in your specific scenarios. ๐Ÿš€

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group