cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

SDas1
New Contributor

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Shubhendu Das​, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.

Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.

However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.

Here's an example of how you could achieve this:

  • First, create the table without the identity column:
spark.sql("""
CREATE TABLE integrated.TrailingWeeks (
    Week_ID int NOT NULL
) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")
  • Next, insert your data into the table. Replace <your_data> with the appropriate data source:
spark.sql("""
INSERT INTO integrated.TrailingWeeks (Week_ID)
SELECT Week_ID
FROM <your_data>""")
  • Finally, use a Window function to generate the identity column values:
spark.sql("""
SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID
FROM integrated.TrailingWeeks""")

This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!