cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

SDas1
New Contributor

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

1 REPLY 1

Kaniz
Community Manager
Community Manager

Hi @Shubhendu Das​, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.

Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.

However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.

Here's an example of how you could achieve this:

  • First, create the table without the identity column:
spark.sql("""
CREATE TABLE integrated.TrailingWeeks (
    Week_ID int NOT NULL
) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")
  • Next, insert your data into the table. Replace <your_data> with the appropriate data source:
spark.sql("""
INSERT INTO integrated.TrailingWeeks (Week_ID)
SELECT Week_ID
FROM <your_data>""")
  • Finally, use a Window function to generate the identity column values:
spark.sql("""
SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID
FROM integrated.TrailingWeeks""")

This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.