04-23-2023 01:24 PM
spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")
04-24-2023 11:42 AM
Hi @Shubhendu Das, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.
Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.
However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.
Here's an example of how you could achieve this:
spark.sql("""
CREATE TABLE integrated.TrailingWeeks (
Week_ID int NOT NULL
) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")
spark.sql("""
INSERT INTO integrated.TrailingWeeks (Week_ID)
SELECT Week_ID
FROM <your_data>""")
spark.sql("""
SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID
FROM integrated.TrailingWeeks""")
This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.
Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections.
Click here to register and join today!
Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.