Databricks Community

Brahmareddy · ‎08-12-2024

When I first started handling schema management in Databricks, I realized that a little bit of planning could save me a lot of headaches down the road. Here’s what I’ve learned and some simple tips that helped me manage schema changes effectively. One of the first things I did was move away from relying on automatic schema inference. It seemed convenient at first, but I quickly found that it led to unexpected issues as my data evolved. Instead, I started defining schemas explicitly using StructType and StructField. Here’s a snippet of how I set up a schema:

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])

Switching to Delta Lake for managing schema evolution was another game-changer. Delta Lake lets me easily adapt my data structure as my needs change. For example, adding a new column is as simple as running:

ALTER TABLE my_table ADD COLUMNS (new_column STRING);

This approach has saved me a lot of time, especially when dealing with larger datasets. Automating schema validation was another big win. By incorporating validation checks into my ETL process with tools like Databricks Autoloader, I could detect schema changes early and avoid bigger issues later on. Trust me, catching these issues upfront makes everything run smoother. Versioning schemas became a must for me after a few close calls.

Whether it’s through Delta Lake or Git, keeping a version history of my schemas has been incredibly helpful. If something goes wrong, I know I can always roll back to a previous version without too much fuss. As my data needs grew, I learned the importance of designing my schemas with future expansion in mind. Simple things like using nullable fields and struct types have made it easier to adapt to new data sources. Regularly reviewing my schema setup has kept it flexible and scalable. Finally, clear documentation has been a lifesaver.

By thoroughly documenting my schema definitions and any changes, I’ve made sure everyone on my team is on the same page. Sharing this through Databricks notebooks has made collaboration much easier. In short, managing schemas in Databricks doesn’t have to be complicated. With a bit of planning, some automation, and clear documentation, you can keep things running smoothly and be ready for whatever comes next.

joao_augusto · a week ago

Hi Brah! Thanks for your post

Currently, I'm trying to get away from schema inference. How do you store the DDL, schema and data type of each table? I guess that it is not write manually for each one.

Do you store it in yaml, json or SQL tables?

It would help me a lot!

Brahmareddy · a week ago

Hi Joao,

How are you doing today? Glad you found the post helpful! Yeah, manually writing out schemas for every table isn’t practical, so I usually store them in JSON or YAML for flexibility. JSON works well if you want to programmatically manage schemas in Python, while YAML is easier to read and maintain.

Another good approach is to store schema definitions in a Delta table within Unity Catalog or a metadata table in a relational database. This way, you can query it anytime to track schema versions. If you're working with Databricks Autoloader, you can also store expected schema definitions in a separate config file and validate incoming data against it before loading.

If you’re dealing with many tables, you could even automate schema generation using Spark’s .schema.json() method to extract existing schema structures and store them for reference. Let me know if you need a sample setup!

Regards,

Brahma

joao_augusto · a week ago

That's interesting, but I think I'm missing something.

To define the schema as json, you would probably read it from a bucket (example), let the spark/autoloader define the schema and save it in a file for validation. In this case, you are also relying in schema inference, so why don't you just use it directly in your ingestion?

I saw a lot of discussions up to schema enforcement. Some just store raw data as string in bronze, making the data type conversion for silver only.

At the moment, in my job, we just read all from the source and write with mergeSchema option. However, we do not define the data types and usually it leads to some strange errors. There are a lot of tables to manage and sometimes I feel that I don't have too much control over it.

If it don't get you in trouble, I would really appreciate a sample of your setup.

Again, thanks for the attention Brahma!

Brahmareddy · a week ago

Hi Joao,

You’re right!—if we let Spark or Autoloader define the schema once and save it for validation, we’re still relying on inference initially. But the key difference is that we use inference just once, store the schema explicitly (as JSON or in a Delta table), and then enforce it going forward instead of letting Spark decide every time. This prevents unexpected type changes and keeps things under control. Storing everything as strings in Bronze works for flexibility, but it pushes complexity to Silver, making transformations trickier. Instead, I prefer defining schemas early, so I catch issues before they cause downstream problems. In my setup, I store schemas as JSON, load them dynamically, and apply them while reading data—this way, I don’t have to manually write schemas for every table. I also validate the schema before writing to Delta to catch mismatches early. Below is simple code snippet that shows how I do this.

Store Schema in JSON

{
"name": "customers",
"columns": [
{"name": "id", "type": "integer", "nullable": false},
{"name": "name", "type": "string", "nullable": false},
{"name": "email", "type": "string", "nullable": true},
{"name": "signup_date", "type": "date", "nullable": true}
]
}

Read Schema from JSON and Convert to StructType

import json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType

# Define a mapping from JSON data types to Spark data types
type_mapping = {
"string": StringType(),
"integer": IntegerType(),
"date": DateType()
}

# Load the schema JSON file from storage (S3, ADLS, or DBFS)
schema_path = "s3://my-bucket/schema/customers_schema.json"
schema_json = spark.read.text(schema_path).collect()[0][0] # Read JSON as a string
schema_dict = json.loads(schema_json) # Convert JSON string to a Python dictionary

# Convert JSON schema to PySpark StructType
def json_to_spark_schema(schema_dict):
return StructType([
StructField(col["name"], type_mapping[col["type"]], col["nullable"])
for col in schema_dict["columns"]
])

schema = json_to_spark_schema(schema_dict)

Read Data with the Enforced Schema

df = spark.read.schema(schema).json("s3://my-bucket/raw/customers/")

df.show()
df.printSchema()

Validate Schema Before Writing to Delta

# Extract expected schema fields
expected_schema = set([f"{field.name}:{field.dataType}" for field in schema.fields])
incoming_schema = set([f"{field.name}:{field.dataType}" for field in df.schema.fields])

# Check for mismatches
if expected_schema != incoming_schema:
raise ValueError("Schema mismatch detected! Check your source data.")

# Write to Delta if schema is correct
df.write.format("delta").mode("append").saveAsTable("gold.customers")

Give a try and let me know for any additional queries.

Regards,

Brahma

joao_augusto · a week ago

Hmmmmmmmmm that makes sense!!!!! I will study this possibility! You are amazing and we are best friends from now.

Thank you for this discussion Brahma.

Regards

Brahmareddy · a week ago

Haha, glad it made sense! Joao.

Try it out, and if you run into any issues, just let me know. Always happy to help! And best friends? You got it! 😄

Databricks Community

My Journey with Schema Management in Databricks

Photos

Join Us as a Local Community Builder!

Exciting Opportunity to Collaborate with Us!

Intelligent Data Warehousing: AI/BI for Self-service Analytics

Share Your Thoughts on Databricks & Get Rewarded!

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April