Hi @oleprince ,
It is currently not possible to use Python to create a Delta table with an identity column generated by default as in Spark SQL. The DeltaTableBuilderCurrently, the
DeltaTableBuilder` API in Delta Lake does not support creating an identity column with the option "generatedByDefaultAs". This functionality can only be achieved using Spark SQL.
As you mentioned, you can create an identity column in Delta Lake using the generatedAlwaysAs
option. This option generates a unique ID value for each row inserted into the table.
If you need to override the generated identity value with a specific value, you can insert the values into your table using Spark SQL and an explicit column list.
Here is an example code snippet that illustrates how you can create aYou can create an identity column in a Delta table using Python and Spark SQL by defining the column and using the SPARK_DEFAULTS_OPTIONS
parameter toAs you mentioned, Delta provides an identity
function to add an identity column to a Delta table. However, it does not provide an option to specify an identity column that allows the inserted value to be manually overridden.
To create an identity column that allows the inserted value to be overridden, you need to use the Spark SQL CREATE TABLE
statement and specify the identity column using the generated by default as identity
syntax. Here is an example of how to create a Delta table using Spark SQL with an identity column:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Create Delta Table with Identity Column").getOrCreate()
# example data to be added
data = [("apple", 2.50), ("banana", 1.25), ("orange", 0.75)]
# create a DataFrame with the sample data
df = spark.createDataFrame(data, ["fruit", "price"])
# create a Delta table using Spark SQL with an identity column
spark.sql("""
CREATE TABLE fruits (
id LONG IDENTITY(1,1),
fruit STRING,
price DOUBLE
) USING DELTA
""")
# insert data into the Delta table with the identity column
df.write.format("delta").insertInto("fruits")
# Alternatively, you can use the INSERT INTO syntax to customize the identity value:
# spark.sql("""
# INSERT INTO fruits (fruit, price) VALUES
# ('lemon', 1.50),
# ('lime', 1.25),
# ('grapefruit', 2.00You can create a Delta table with an identity column using PySpark by defining a user-defined function (UDF) that generates the identity values and using it in a `withColumn` expression on the DataFrame before writing to Delta.
Here is an example of how this can be done:
```python
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from delta.tables import DeltaTable
# Define the UDF to generate identity values
my_udf = udf(lambda x: x + 1000, IntegerType())
# Load the source data into a DataFrame
source_df = spark.read.format("csv").load("/path/to/source/data.csv")
# Add an identity column to the DataFrame using the UDF
delta_df = source_df.withColumn("id", my_udf(monotonically_increasing_id()))
# Write the DataFrame to Delta
delta_path = "/path/to/delta/table"
delta_df.write.format("delta") \
.option("path", delta_path) \
.mode("overwrite") \
.saveAsTable("delta_table_name")
# Register the Delta table
delta_table = DeltaTable.forPath(spark, delta_path)
delta_table_name = "delta_table_name"
delta_table.createOrReplaceTempView(delta_table_name)
In this example, the monotonically_increasing_id() function is used to generate a unique, increasing ID value for each row in the DataFrame. The UDF my_udf
adds a constant value to this ID to create the identity values. By default, the ID value is generated by Spark and cannot be manually overridden. However, you can add a column to the DataFrame that allows you to manually set the identity value if needed.
Note that this approach requires a UDF for identity column generation and also requires the use of monotonically_increasing_id()
which has limitations, such as a limited domain (smaller than double).