Databricks Community

SDas1 · ‎04-23-2023

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

loki_9191 · ‎10-29-2024

hi,

I’m experiencing the same issue. Have you found a solution for it?

agallard · ‎10-29-2024

Hi,

When you define an identity column in Databricks with GENERATED BY DEFAULT AS IDENTITY (START WITH 0 INCREMENT BY 1), it is expected to start at 0 and increment by 1. However, due to Databricks' distributed architecture, the values may not be strictly sequential (especially when parallel tasks are writing to the table). This is because identity columns are managed at the Spark partition level rather than globally across the entire dataset, leading to increments that may be greater than 1.

Here's how you can address this:

Confirm Increment Across Partitions: If you want truly sequential IDs, it's often challenging in distributed environments. Instead, you might consider using a different mechanism to ensure unique IDs if strict sequencing isn’t essential.
Use Monotonically Increasing ID: If you need a sequential ID within each partition, you can generate IDs with monotonically_increasing_id() from PySpark. This won’t provide strict sequencing across all rows but will ensure uniqueness.
Create a Single-Threaded Job for ID Assignment: If you need a strict sequence, another approach is to run a single-threaded job or use batch processing that guarantees ID continuity. This could be achieved by collecting the data on a single machine, assigning the IDs, and writing it back to Delta Lake.
Example Modification for Sequential IDs: Here’s an example using monotonically_increasing_id:

from pyspark.sql.functions import monotonically_increasing_id

df = (spark.read.format("delta").load("dbfs:/<Path in Azure datalake>/delta")
      .withColumn("ID", monotonically_increasing_id()))
df.write.format("delta").mode("overwrite").save("dbfs:/<Path in Azure datalake>/delta")

Consider Global Sequence Management: For larger pipelines, Databricks does not support sequences as in RDBMS systems. If a strict, auto-incrementing ID sequence is required, you might need an external service or table that maintains a global sequence counter.

These are practical alternatives try it and let us know how it goes.

Regards

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark

Databricks Community

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

Connect with Databricks Users in Your Area

Submit your feedback and win a $50 gift card!

Share Your Feedback in Our Community Survey

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!