cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

SDas1
New Contributor

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

2 REPLIES 2

loki_9191
New Contributor II

hi,

I’m experiencing the same issue. Have you found a solution for it?

agallard
Contributor

Hi,

When you define an identity column in  Databricks with GENERATED BY DEFAULT AS IDENTITY (START WITH 0 INCREMENT BY 1), it is expected to start at 0 and increment by 1. However, due to Databricks' distributed architecture, the values may not be strictly sequential (especially when parallel tasks are writing to the table). This is because identity columns are managed at the Spark partition level rather than globally across the entire dataset, leading to increments that may be greater than 1.

Here's how you can address this:

  1. Confirm Increment Across Partitions: If you want truly sequential IDs, it's often challenging in distributed environments. Instead, you might consider using a different mechanism to ensure unique IDs if strict sequencing isn’t essential.

  2. Use Monotonically Increasing ID: If you need a sequential ID within each partition, you can generate IDs with monotonically_increasing_id() from PySpark. This won’t provide strict sequencing across all rows but will ensure uniqueness.

  3. Create a Single-Threaded Job for ID Assignment: If you need a strict sequence, another approach is to run a single-threaded job or use batch processing that guarantees ID continuity. This could be achieved by collecting the data on a single machine, assigning the IDs, and writing it back to Delta Lake.

  4. Example Modification for Sequential IDs: Here’s an example using monotonically_increasing_id:

     
     
     

 

from pyspark.sql.functions import monotonically_increasing_id

df = (spark.read.format("delta").load("dbfs:/<Path in Azure datalake>/delta")
      .withColumn("ID", monotonically_increasing_id()))
df.write.format("delta").mode("overwrite").save("dbfs:/<Path in Azure datalake>/delta")

 

  • Consider Global Sequence Management: For larger pipelines, Databricks does not support sequences as in RDBMS systems. If a strict, auto-incrementing ID sequence is required, you might need an external service or table that maintains a global sequence counter.

These are practical alternatives try it and let us know how it goes.

Regards

 

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group