Databricks Community

Red_blue_green · ‎02-08-2024

Hello,

I have currently a delta folder as a table with several columns that are nullable. I want to migrate data to the table and overwrite the content using Pyspark, add several new columns and make them not nullable. I have found a way to make the columns in the pyspark df as non-nullable:

non_nullable_schema = StructType([
    StructField("column1", StringType(), nullable=False),
    StructField("column2", StringType(), nullable=False),
])

# Apply the new schema to the DataFrame
non_nullable_df = spark.createDataFrame(df.rdd, non_nullable_schema)

But it seems like after I write to the existing delta destination folder and load it again, it shows that the columns are nullable again after I print the schema :

non_nullable_df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save("/path/to/delta/files")
df_read=spark.read.format("delta").load("/path/to/delta/files")
df_read.printSchema()

Is there any way to change an existing schema in delta to not nullable using pyspark without creating a new delta table?

Husky · ‎02-08-2024

You could save the dataframe as a table instead of a delta file and then alter the table to set the columns not nullable:

table = <your_table_name>
column_name = <column_name>

non_nullable_df.write.saveAsTable(table, mode="overwrite")

spark.sql(f"ALTER TABLE {table} ALTER column {column_name} SET NOT NULL")

Make sure that there are no null values in the column which you want to make not nullable. Otherwise you will get an error.

Red_blue_green · ‎02-08-2024

Thank you for the suggestion but my current constraint is unfortunately to work with delta files. So saving as a table would not be enough.

kanjinghat · ‎05-07-2024

Not sure if you found a solution, you can also try as below. In this case you pass the full path to the delta not the table itself.

spark.sql(f"ALTER TABLE delta.`{full_delta_path}` ALTER column {column_name} SET NOT NULL")

Databricks Community

Databricks: Change the existing schema of columns to non-nullable for a delta table using Pyspark?

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs