Hi,
When you define an identity column in Databricks with GENERATED BY DEFAULT AS IDENTITY (START WITH 0 INCREMENT BY 1), it is expected to start at 0 and increment by 1. However, due to Databricks' distributed architecture, the values may not be strictly sequential (especially when parallel tasks are writing to the table). This is because identity columns are managed at the Spark partition level rather than globally across the entire dataset, leading to increments that may be greater than 1.
Here's how you can address this:
Confirm Increment Across Partitions: If you want truly sequential IDs, it's often challenging in distributed environments. Instead, you might consider using a different mechanism to ensure unique IDs if strict sequencing isn’t essential.
Use Monotonically Increasing ID: If you need a sequential ID within each partition, you can generate IDs with monotonically_increasing_id() from PySpark. This won’t provide strict sequencing across all rows but will ensure uniqueness.
Create a Single-Threaded Job for ID Assignment: If you need a strict sequence, another approach is to run a single-threaded job or use batch processing that guarantees ID continuity. This could be achieved by collecting the data on a single machine, assigning the IDs, and writing it back to Delta Lake.
Example Modification for Sequential IDs: Here’s an example using monotonically_increasing_id:
from pyspark.sql.functions import monotonically_increasing_id
df = (spark.read.format("delta").load("dbfs:/<Path in Azure datalake>/delta")
.withColumn("ID", monotonically_increasing_id()))
df.write.format("delta").mode("overwrite").save("dbfs:/<Path in Azure datalake>/delta")
Consider Global Sequence Management: For larger pipelines, Databricks does not support sequences as in RDBMS systems. If a strict, auto-incrementing ID sequence is required, you might need an external service or table that maintains a global sequence counter.
These are practical alternatives try it and let us know how it goes.
Regards
Alfonso Gallardo
-------------------
I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark