Databricks

SDas1 · ‎04-23-2023

spark.sql("CREATE TABLE integrated.TrailingWeeks(ID bigint GENERATED BY DEFAULT AS IDENTITY (START WITH 0 increment by 1) ,Week_ID int NOT NULL) USING delta OPTIONS (path 'dbfs:/<Path in Azure datalake>/delta')")

Kaniz · ‎04-24-2023

Hi @Shubhendu Das, Thank you for contacting us about your concern about the identity column values in your Databricks Delta table. I understand the deals are not starting at 0 or incrementing by one as expected.

Databricks Delta Lake does not guarantee consecutive identity values because of its distributed nature. When creating an identity column in a Delta table, the values generated are unique across the table but may not be consecutive.

However, if you require consecutive identity values and tolerate a performance trade-off, you may consider using Windows functions to generate the identity column values.

Here's an example of how you could achieve this:

First, create the table without the identity column:

spark.sql("""
CREATE TABLE integrated.TrailingWeeks (
    Week_ID int NOT NULL
) USING delta OPTIONS (path 'dbfs:/<Path in Azure Data Lake>/delta')""")

Next, insert your data into the table. Replace <your_data> with the appropriate data source:

spark.sql("""
INSERT INTO integrated.TrailingWeeks (Week_ID)
SELECT Week_ID
FROM <your_data>""")

Finally, use a Window function to generate the identity column values:

spark.sql("""
SELECT ROW_NUMBER() OVER (ORDER BY Week_ID) - 1 as ID, Week_ID
FROM integrated.TrailingWeeks""")

This method will have performance implications, especially for large data sets, as it may involve shuffling data across the cluster. If your use case can tolerate non-consecutive identity values, it's recommended to stick with the default behaviour provided by Delta Lake.

Databricks

Identity column value of Databricks delta table is not started with 0 and increaed by 1. It always started with something like 1 or 2 and increased by 2. Below is the sample code and any logical input here is appreciated

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs