Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

SDas1 — Sun, 23 Apr 2023 20:24:57 GMT

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

Re: Identity column value of Databricks delta table is not started with 0 and increaed by 1. It alwa

loki_9191 — Tue, 29 Oct 2024 12:10:32 GMT

hi,

I’m experiencing the same issue. Have you found a solution for it?

Re: Identity column value of Databricks delta table is not started with 0 and increaed by 1. It alwa

agallard — Tue, 29 Oct 2024 13:20:07 GMT

Hi,

When you define an identity column in Databricks with GENERATED BY DEFAULT AS IDENTITY (START WITH 0 INCREMENT BY 1), it is expected to start at 0 and increment by 1. However, due to Databricks' distributed architecture, the values may not be strictly sequential (especially when parallel tasks are writing to the table). This is because identity columns are managed at the Spark partition level rather than globally across the entire dataset, leading to increments that may be greater than 1.

Here's how you can address this:

Confirm Increment Across Partitions: If you want truly sequential IDs, it's often challenging in distributed environments. Instead, you might consider using a different mechanism to ensure unique IDs if strict sequencing isn’t essential.
Use Monotonically Increasing ID: If you need a sequential ID within each partition, you can generate IDs with monotonically_increasing_id() from PySpark. This won’t provide strict sequencing across all rows but will ensure uniqueness.
Create a Single-Threaded Job for ID Assignment: If you need a strict sequence, another approach is to run a single-threaded job or use batch processing that guarantees ID continuity. This could be achieved by collecting the data on a single machine, assigning the IDs, and writing it back to Delta Lake.
Example Modification for Sequential IDs: Here’s an example using monotonically_increasing_id:

from pyspark.sql.functions import monotonically_increasing_id df = (spark.read.format("delta").load("dbfs:/<Path in Azure datalake>/delta") .withColumn("ID", monotonically_increasing_id())) df.write.format("delta").mode("overwrite").save("dbfs:/<Path in Azure datalake>/delta")

Consider Global Sequence Management: For larger pipelines, Databricks does not support sequences as in RDBMS systems. If a strict, auto-incrementing ID sequence is required, you might need an external service or table that maintains a global sequence counter.

These are practical alternatives try it and let us know how it goes.

Regards

topic Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated in Data Engineering

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

Re: Identity column value of Databricks delta table is not started with 0 and increaed by 1. It alwa

Re: Identity column value of Databricks delta table is not started with 0 and increaed by 1. It alwa