Recently I discovered an issue when creating a PARQUET table that contains a column of type STRUCT with more than 350 string subfields. Such a table can be successfully created via a standard DDL script nevertheless each subsequent attempt to work with the table via spark.sql ends up with an Illegal Argument Exception hinting malformed definition of the column. You cannot even drop it unless you use the approach posted here: https://kb.databricks.com/metastore/drop-table-corruptedmetadata
I have searched the internet for leads but was not successful. Is there some hidden limit for STRUCT type definition for PARQUET and Metastore? BTW, changing the format of the table to Delta helps.
The issue can be replicable with the following python script:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
def generate_unique_strings(count, size=5):
"""
Returns generator of unique 'count' strings of 'size' length
"""
seen_strings = set()
while len(seen_strings) < count:
new_string = ''.join(random.choice(string.ascii_letters) for _ in range(size)).upper()
if new_string not in seen_strings:
seen_strings.add(new_string)
yield new_string
def get_ddl_for_single_struct_table(struct_field_size, tbl_name, db_name="temp", ):
"""
Returns string create statement for a table with one struct with 'struct_field_size' STRING fields.
"""
struct_fields_names = generate_unique_strings(count=struct_field_size)
struct_fields = ','.join([f"`{field}`: STRING" for field in struct_fields_names])
return f"CREATE TABLE {db_name}.{tbl_name} (`test` STRUCT<{struct_fields}>) USING PARQUET;"
if __name__ == "__main__":
# Create a testing DB
spark.sql("CREATE DATABASE IF NOT EXISTS temp")
# This will succeed - create a table with test struct column with 300 string fields and try to show its DDL
TEST_TABLE_NAME_1 = "struct_size_limit_test_300"
STRUCT_SIZE_1 = 300
spark.sql(get_ddl_for_single_struct_table(struct_field_size = STRUCT_SIZE_1, tbl_name = TEST_TABLE_NAME_1))
spark.sql(f"show create table temp.{TEST_TABLE_NAME_1}")
# This will produce an exception - create a table with test struct column with 500 string fields and try to show its DDL.
TEST_TABLE_NAME_2 = "struct_size_limit_test_500"
STRUCT_SIZE_2 = 500
spark.sql(get_ddl_for_single_struct_table(struct_field_size = STRUCT_SIZE_2,tbl_name = TEST_TABLE_NAME_2))
spark.sql(f"show create table temp.{TEST_TABLE_NAME_2}")
I