Hi @Shubhendu Das, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.
Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.
However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.
Here's an example of how you could achieve this:
- First, create the table without the identity column:
spark.sql("""
CREATE TABLE integrated.TrailingWeeks (
Week_ID int NOT NULL
) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")
- Next, insert your data into the table. Replace <your_data> with the appropriate data source:
spark.sql("""
INSERT INTO integrated.TrailingWeeks (Week_ID)
SELECT Week_ID
FROM <your_data>""")
- Finally, use a Window function to generate the identity column values:
spark.sql("""
SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID
FROM integrated.TrailingWeeks""")
This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.