cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Databricks: Change the existing schema of columns to non-nullable for a delta table using Pyspark?

Red_blue_green
New Contributor III

Hello,

I have currently a delta folder as a table with several columns that are nullable. I want to migrate data to the table and overwrite the content using Pyspark, add several new columns and make them not nullable. I have found a way to make the columns in the pyspark df as non-nullable:

non_nullable_schema = StructType([
    StructField("column1", StringType(), nullable=False),
    StructField("column2", StringType(), nullable=False),
])

# Apply the new schema to the DataFrame
non_nullable_df = spark.createDataFrame(df.rdd, non_nullable_schema)

But it seems like after I write to the existing delta destination folder and load it again, it shows that the columns are nullable again after I print the schema :

non_nullable_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta/files")
df_read=spark.read.format("delta").load("/path/to/delta/files")
df_read.printSchema() 

Is there any way to change an existing schema in delta to not nullable using pyspark without creating a new delta table?

3 REPLIES 3

Husky
New Contributor III

You could save the dataframe as a table instead of a delta file and then alter the table to set the columns not nullable:

table = <your_table_name>
column_name = <column_name>

non_nullable_df.write.saveAsTable(table, mode="overwrite")

spark.sql(f"ALTER TABLE {table} ALTER column {column_name} SET NOT NULL")

Make sure that there are no null values in the column which you want to make not nullable. Otherwise you will get an error.

Red_blue_green
New Contributor III

Thank you for the suggestion but my current constraint is unfortunately to work with delta files. So saving as a table would not be enough.

kanjinghat
New Contributor II

Not sure if you found a solution, you can also try as below. In this case you pass the full path to the delta not the table itself.

spark.sql(f"ALTER TABLE delta.`{full_delta_path}` ALTER column {column_name} SET NOT NULL")

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group