Hi @Shubhendu Das, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.
Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.
However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.
Here's an example of how you could achieve this:
spark.sql(""" CREATE TABLE integrated.TrailingWeeks ( Week_ID int NOT NULL ) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")
spark.sql(""" INSERT INTO integrated.TrailingWeeks (Week_ID) SELECT Week_ID FROM <your_data>""")
spark.sql(""" SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID FROM integrated.TrailingWeeks""")
This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!