cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data contract implementation best practices

jar
New Contributor III

Hi all.

We've written some .yml files for our data products in a UC-enabled workspace (dev and prod). We've constructed a directory identical to the one containing the scripts which ultimately creates these products and put them there, initially for governance but we plan to use them programmatically as well in e.g., our workflows.

I've tried to do some reading on whether this is best practice, or what best practice even is for the implementation of data contracts. If anyone would care to share their experience and/or knowledge on the matter, it would be much appreciated!

Best,

Johan.

4 REPLIES 4

VZLA
Databricks Employee
Databricks Employee

Implementing data contracts using .yml files in your Unity Catalog-enabled workspace is a sound practice, especially as it allows for programmatic use in workflows and aids in governance. Best practices for this approach include:

  • Catalog Organization: Segregate data using catalogs based on environment (development, production), teams, or business units. This helps in managing access and maintaining clarity.

  • Governance and Access Control: Assign permissions to groups rather than individual users to simplify management. Centralized governance ensures consistency across teams while allowing them to focus on data production and insights.

  • Data Contract Contents: Your data contracts should include key attributes like data descriptions, schemas, usage policies, data quality metrics, security guidelines, and service-level agreements (SLAs). This ensures that data consumers have all the necessary information.

  • Consumer-Centric Design: Design data contracts with the consumer in mind. Providing supporting assets like notebooks, dashboards, or sample code can enhance understanding and usability.

Your current strategy of storing .yml files alongside the scripts that generate your data products aligns well with these best practices. It facilitates both governance and programmatic access, ensuring that your data products are well-documented and easily consumable by various stakeholders.

jar
New Contributor III

Hi. 

Thank you for your response and apologies for my delay in replying. 

Glad to know we are following best practices. Could you give an example of how a data contract stored in this manner can be used programmatically? I heard you can wrap them and use pytest to test for alignment between data product and data contract?

Lorenzo
New Contributor II

Would also like to see some more information/examples on this.

VZLA
Databricks Employee
Databricks Employee

Thank you for your follow-up question.

Yes, if it helps, this would be a good starting point/demo:

import yaml
import pytest

# Load the data contract
with open('data_contract.yml', 'r') as file:
    data_contract = yaml.safe_load(file)

# Example data product schema
data_product_schema = {
    'name': 'string',
    'age': 'integer',
    'email': 'string'
}

# Test to check alignment between data product and data contract
def test_data_product_alignment():
    for field, field_type in data_contract['schema'].items():
        assert field in data_product_schema, f"Field '{field}' is missing in the data product schema."
        assert data_product_schema[field] == field_type, (
            f"Field '{field}' type mismatch: expected '{field_type}', got '{data_product_schema[field]}'."
        )

# Additional example: check for unexpected fields in the data product
def test_no_unexpected_fields():
    for field in data_product_schema:
        assert field in data_contract['schema'], f"Unexpected field '{field}' found in the data product schema."

if __name__ == '__main__':
    pytest.main()

 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group