Authors: Malay Panigrahi and Nikhil Chandna
Organisations increasingly rely on data to fuel their decisions and gain a competitive edge. However, as the volume and complexity of data grow, so do the challenges associated with ensuring its quality. Poor data quality isn’t just an inconvenience; it’s a liability that can lead to misguided strategies, lost revenue, regulatory fines, and reputational damage.
A data quality system is not just a set of tools or processes; it's a strategic asset that underpins every aspect of your business operations. It ensures that the data you rely on is accurate, consistent, and fit for purpose, enabling you to make informed decisions confidently. Traditionally, data quality monitoring frameworks have focused primarily on the ETL processes and pipelines, emphasising data processing. However, monitoring data quality is equally essential on the analyst side, where tools like Databricks SQL are used. Analysts are often the final gatekeepers of data insights, and ensuring data quality at this stage is crucial for maintaining the integrity of business decisions.
This blog will explore how Databricks’ capabilities can be harnessed to build a solid and effective data quality system, extending beyond traditional boundaries to include the analyst's perspective.
Data Quality Management on Databricks SQL
Databricks offers several key features for effective data quality management:
Let’s look into each of the above techniques individually from the viewpoint of a fictional Retailer.
Consider a national Retailer that manages an extensive loyalty program to reward frequent customers. The Retailer collects customer data, including purchase history, preferences, and demographic information, from various sources, such as in-store transactions, online purchases, and third-party marketing platforms.
The Retailer faces a challenge when integrating customer data from a new third-party marketing platform. This platform provides customer data with additional fields such as "Social Media Handles" and "Influencer Status," which are not part of the retailer's existing schema. Additionally, some fields like "Date of Birth" are provided in different formats than expected.
Schema Enforcement
Schema enforcement also known as schema validation, ensures data quality by rejecting writes to a table that do not match the table's schema. Here’s how schema enforcement works in Delta Lake:
For instance, take a look at what happens when a dataframe with mismatched schema is written to an existing table.
Please note that Databricks attempts to safely cast column data types to match the target table.
By enforcing strict schema control, you’re encouraged to be intentional about the structure of your data, maintaining high standards and ensuring data quality. This keeps your tables clean and purposeful, allowing them to serve their function effectively.
However, if you decide after careful consideration that adding a new column is necessary, you can easily accomplish this using the schema evolution feature.
Schema evolution is activated by adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.
By using schema enforcement in Databricks, our example Retailer can ensure that all incoming customer data adheres to the predefined schema. This means any data with unexpected columns or incorrect formats is automatically flagged or rejected before being written.
Now consider the same retailer encounters problems with customer data entries, particularly with missing or incorrect email addresses. Some entries have null values or invalid formats, leading to failed communications and ineffective marketing campaigns.
Constraints
Databricks supports standard SQL constraint management clauses. When constraints are enforced, they ensure that the quality and integrity of data added to a table are automatically verified. Two types of constraints are supported with Delta tables:
When a constraint is violated for any row, the entire transaction will fail, raise an error and roll back.
|
The two examples below show how constraints can be added, removed and displayed.
By implementing constraints in Databricks, the retailer can set a NOT NULL constraint on the "Email" column to ensure that every customer record includes an email address. Additionally, a CHECK constraint can be applied to validate the format of email addresses using a regular expression. This ensures that only valid and complete email addresses are entered into the system, improving the effectiveness of marketing efforts and enhancing customer engagement.
Our retailer now faces challenges with data quality over time, particularly with the consistency and accuracy of customer transaction data. As data is collected from multiple channels, there can be discrepancies in the number of transactions recorded or anomalies in spending patterns that could indicate data quality issues such as missing or duplicated records.
Lakehouse Monitoring
Lakehouse Monitoring allows you to monitor the quality of all your tables registered in UC. To use Lakehouse Monitoring, your workspace must be UC enabled, and you must have access to Databricks SQL.
Users can configure a table to be monitored using the Databricks UI or the API. Lakehouse Monitoring provides the following types of analysis:
A table monitor produces two tables and a dashboard automatically.
Refer to the official documentation (AWS | Azure) to learn how to use the Lakehouse Monitoring.
You can also configure custom metrics, which is useful when you want to create specific measures that capture the aspects of business logic. You can use different types of custom metrics.
For example, you want to track the average price of orders in your orders table.
Let’s take another example; let’s say you want to track the delayed orders (orders that are open in state after 30 days of their creation date):
Once the metrics are saved, they will appear in the Lakehouse Monitoring tab as shown below:
In addition to various other metrics, the custom metrics are stored in the Profile metrics table.
Note: If you have a large number of tables to monitor, you can use the API to efficiently enable monitoring across multiple tables programmatically.
By implementing Lakehouse Monitoring in Databricks, the retailer can continuously track the quality of their customer transaction data. Lakehouse Monitoring allows the retailer to set up automated profiling and drift analysis on their transaction tables. This setup helps detect changes in data distribution, such as unexpected spikes or drops in transaction volumes or inconsistencies in spending patterns.
Our retailer also experiences occasional data quality issues, such as unexpectedly low transaction numbers recorded in their Gold layer tables due to upstream system errors or data ingestion failures. These issues can lead to incomplete customer profiles and inaccurate reward calculations, affecting customer trust and engagement.
Setting up Alerts
Now that your metrics table is prepared, you can configure alerts to notify you of potential errors, helping to mitigate downstream risks.For instance, you can configure an alert to receive notifications if the percentage of nulls or zeros surpasses a defined threshold or shows changes over time.
Let's look at an example of configuring alerts for our custom metrics, such as avg_price and delayed_orders.
Select COUNT(delayed_orders) FROM malay_demo.tpch.orders_profile_metrics;
In the same way, you can configure alerts for key metrics to take proactive steps in maintaining the overall health and reliability of your data. To visualise these metrics, Lakehouse Monitoring offers a customisable, out-of-the-box dashboard
By using Databricks SQL alerts, the retailer can set up automated notifications to monitor key metrics related to transaction data quality.
Unfortunately, the Retailer discovers discrepancies in customer purchase data, leading to incorrect reward calculations and customer dissatisfaction. These discrepancies arise from errors during data integration from multiple sources, making it difficult to pinpoint where the issue originated.
Data Lineage
Data lineage is integral to data governance, providing valuable insights that indirectly support data quality management. Through the Unity Catalog, Databricks offers detailed data lineage tracking, capturing data flow across queries, notebooks, jobs, and dashboards. This comprehensive view extends to the column level, allowing users to see how data is transformed and utilised across various processes within the organisation.
While data lineage does not directly monitor or ensure data quality, it is instrumental in identifying and resolving issues that may affect it. By tracing the path of data from its origin to its final destination, lineage helps pinpoint where errors or inconsistencies may have been introduced. This capability enables data teams to quickly diagnose problems and implement corrective actions, thereby maintaining the integrity and reliability of the data.
Additionally, understanding data lineage aids in impact analysis when changes are made to data sources or processing workflows. It ensures that modifications do not negatively affect downstream applications or analyses, thus preserving data quality. Moreover, lineage provides transparency and accountability, crucial for compliance with regulatory requirements and internal standards.
By utilising data lineage through Databricks' Unity Catalog, our example retailer can trace the entire journey of customer data from its source to its final destination. This capability allows them to easily and accurately identify where errors or inconsistencies were introduced during data transformations or integrations.
Conclusion
In conclusion, ensuring robust data quality is critical for maintaining your data ecosystem's integrity, reliability, and accuracy. Leveraging tools such as Constraints, Lakehouse Monitoring, customisable dashboards, alerts, and data lineage capabilities empower organisations to proactively detect, address, and prevent data issues before they impact downstream processes. Adopting a comprehensive approach to data quality management can build trust in your data, improve decision-making, and enhance overall business outcomes.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.