Data contract implementation best practices
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-29-2024 08:18 PM
Hi all.
We've written some .yml files for our data products in a UC-enabled workspace (dev and prod). We've constructed a directory identical to the one containing the scripts which ultimately creates these products and put them there, initially for governance but we plan to use them programmatically as well in e.g., our workflows.
I've tried to do some reading on whether this is best practice, or what best practice even is for the implementation of data contracts. If anyone would care to share their experience and/or knowledge on the matter, it would be much appreciated!
Best,
Johan.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-31-2024 09:44 AM - edited 10-31-2024 09:48 AM
Implementing data contracts using .yml
files in your Unity Catalog-enabled workspace is a sound practice, especially as it allows for programmatic use in workflows and aids in governance. Best practices for this approach include:
-
Catalog Organization: Segregate data using catalogs based on environment (development, production), teams, or business units. This helps in managing access and maintaining clarity.
-
Governance and Access Control: Assign permissions to groups rather than individual users to simplify management. Centralized governance ensures consistency across teams while allowing them to focus on data production and insights.
-
Data Contract Contents: Your data contracts should include key attributes like data descriptions, schemas, usage policies, data quality metrics, security guidelines, and service-level agreements (SLAs). This ensures that data consumers have all the necessary information.
-
Consumer-Centric Design: Design data contracts with the consumer in mind. Providing supporting assets like notebooks, dashboards, or sample code can enhance understanding and usability.
Your current strategy of storing .yml
files alongside the scripts that generate your data products aligns well with these best practices. It facilitates both governance and programmatic access, ensuring that your data products are well-documented and easily consumable by various stakeholders.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-11-2024 07:35 PM
Hi.
Thank you for your response and apologies for my delay in replying.
Glad to know we are following best practices. Could you give an example of how a data contract stored in this manner can be used programmatically? I heard you can wrap them and use pytest to test for alignment between data product and data contract?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2024 02:36 AM
Would also like to see some more information/examples on this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-12-2024 06:45 AM
Thank you for your follow-up question.
Yes, if it helps, this would be a good starting point/demo:
import yaml
import pytest
# Load the data contract
with open('data_contract.yml', 'r') as file:
data_contract = yaml.safe_load(file)
# Example data product schema
data_product_schema = {
'name': 'string',
'age': 'integer',
'email': 'string'
}
# Test to check alignment between data product and data contract
def test_data_product_alignment():
for field, field_type in data_contract['schema'].items():
assert field in data_product_schema, f"Field '{field}' is missing in the data product schema."
assert data_product_schema[field] == field_type, (
f"Field '{field}' type mismatch: expected '{field_type}', got '{data_product_schema[field]}'."
)
# Additional example: check for unexpected fields in the data product
def test_no_unexpected_fields():
for field in data_product_schema:
assert field in data_contract['schema'], f"Unexpected field '{field}' found in the data product schema."
if __name__ == '__main__':
pytest.main()

