Databricks Community

oleprince · ‎10-14-2023

Hello,

Would anyone know if it is possible to create a delta table using Python that includes a column that is generated by default as identity (identity column for which the value inserted can be manually overriden)?

There seems to be a way to create an identity column using Python (through Delta Lake api, using the DeltaTableBuilder) that is generated always as identity (using option "generatedAlwaysAs") but I havent found the equivalent "generatedByDefaultAs" option).

So far, the only way I've found in order to create a column that is generated by default as identity is through Spark SQL.

thanks

Kaniz_Fatma · ‎10-16-2023

Hi @oleprince ,

It is currently not possible to use Python to create a Delta table with an identity column generated by default as in Spark SQL. The DeltaTableBuilderCurrently, the DeltaTableBuilder` API in Delta Lake does not support creating an identity column with the option "generatedByDefaultAs". This functionality can only be achieved using Spark SQL.

As you mentioned, you can create an identity column in Delta Lake using the generatedAlwaysAs option. This option generates a unique ID value for each row inserted into the table.

If you need to override the generated identity value with a specific value, you can insert the values into your table using Spark SQL and an explicit column list.

Here is an example code snippet that illustrates how you can create aYou can create an identity column in a Delta table using Python and Spark SQL by defining the column and using the SPARK_DEFAULTS_OPTIONS parameter toAs you mentioned, Delta provides an identity function to add an identity column to a Delta table. However, it does not provide an option to specify an identity column that allows the inserted value to be manually overridden.

To create an identity column that allows the inserted value to be overridden, you need to use the Spark SQL CREATE TABLE statement and specify the identity column using the generated by default as identity syntax. Here is an example of how to create a Delta table using Spark SQL with an identity column:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Create Delta Table with Identity Column").getOrCreate()

# example data to be added
data = [("apple", 2.50), ("banana", 1.25), ("orange", 0.75)]

# create a DataFrame with the sample data
df = spark.createDataFrame(data, ["fruit", "price"])

# create a Delta table using Spark SQL with an identity column
spark.sql("""
CREATE TABLE fruits (
   id LONG IDENTITY(1,1),
   fruit STRING,
   price DOUBLE
) USING DELTA
""")

# insert data into the Delta table with the identity column
df.write.format("delta").insertInto("fruits")

# Alternatively, you can use the INSERT INTO syntax to customize the identity value:
# spark.sql("""
# INSERT INTO fruits (fruit, price) VALUES
#     ('lemon', 1.50),
#     ('lime', 1.25),
#     ('grapefruit', 2.00You can create a Delta table with an identity column using PySpark by defining a user-defined function (UDF) that generates the identity values and using it in a `withColumn` expression on the DataFrame before writing to Delta.

Here is an example of how this can be done:
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from delta.tables import DeltaTable

# Define the UDF to generate identity values
my_udf = udf(lambda x: x + 1000, IntegerType())

# Load the source data into a DataFrame
source_df = spark.read.format("csv").load("/path/to/source/data.csv")

# Add an identity column to the DataFrame using the UDF
delta_df = source_df.withColumn("id", my_udf(monotonically_increasing_id()))

# Write the DataFrame to Delta
delta_path = "/path/to/delta/table"
delta_df.write.format("delta") \
    .option("path", delta_path) \
    .mode("overwrite") \
    .saveAsTable("delta_table_name")

# Register the Delta table
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table_name = "delta_table_name"
delta_table.createOrReplaceTempView(delta_table_name)

In this example, the monotonically_increasing_id() function is used to generate a unique, increasing ID value for each row in the DataFrame. The UDF my_udf adds a constant value to this ID to create the identity values. By default, the ID value is generated by Spark and cannot be manually overridden. However, you can add a column to the DataFrame that allows you to manually set the identity value if needed.

Note that this approach requires a UDF for identity column generation and also requires the use of monotonically_increasing_id() which has limitations, such as a limited domain (smaller than double).