Databricks Community

MalayPanigrahi · ‎11-18-2024

Authors: Malay Panigrahi and Nikhil Chandna

Organisations increasingly rely on data to fuel their decisions and gain a competitive edge. However, as the volume and complexity of data grow, so do the challenges associated with ensuring its quality. Poor data quality isn’t just an inconvenience; it’s a liability that can lead to misguided strategies, lost revenue, regulatory fines, and reputational damage.

A data quality system is not just a set of tools or processes; it's a strategic asset that underpins every aspect of your business operations. It ensures that the data you rely on is accurate, consistent, and fit for purpose, enabling you to make informed decisions confidently. Traditionally, data quality monitoring frameworks have focused primarily on the ETL processes and pipelines, emphasising data processing. However, monitoring data quality is equally essential on the analyst side, where tools like Databricks SQL are used. Analysts are often the final gatekeepers of data insights, and ensuring data quality at this stage is crucial for maintaining the integrity of business decisions.

This blog will explore how Databricks’ capabilities can be harnessed to build a solid and effective data quality system, extending beyond traditional boundaries to include the analyst's perspective.

Data Quality Management on Databricks SQL

Databricks offers several key features for effective data quality management:

Schema enforcement
Constraints
Lakehouse Monitoring
Databricks SQL alerts
Data lineage

Let’s look into each of the above techniques individually from the viewpoint of a fictional Retailer.

Consider a national Retailer that manages an extensive loyalty program to reward frequent customers. The Retailer collects customer data, including purchase history, preferences, and demographic information, from various sources, such as in-store transactions, online purchases, and third-party marketing platforms.

The Retailer faces a challenge when integrating customer data from a new third-party marketing platform. This platform provides customer data with additional fields such as "Social Media Handles" and "Influencer Status," which are not part of the retailer's existing schema. Additionally, some fields like "Date of Birth" are provided in different formats than expected.

Schema Enforcement

Schema enforcement also known as schema validation, ensures data quality by rejecting writes to a table that do not match the table's schema. Here’s how schema enforcement works in Delta Lake:

Schema Validation on Write: During write operations, Delta Lake checks new data for compatibility with the target table's schema. If there's a mismatch, the transaction is cancelled, no data is written, and an exception is raised to inform the user.
Column Compatibility:
- The DataFrame being written must not contain additional columns absent from the target table's schema and must match the column data types.
- Regarding case sensitivity, Delta Lake preserves case but treats schemas as case-insensitive. This allows you to query using variations like 'Order_Number' or 'order_number', and Delta Lake will recognize them as the same. However, you cannot store two columns with names differing only by case, such as 'Order_Number' and 'order_number'; Delta Lake will raise an error to prevent this.

For instance, take a look at what happens when a dataframe with mismatched schema is written to an existing table.

Please note that Databricks attempts to safely cast column data types to match the target table.

By enforcing strict schema control, you’re encouraged to be intentional about the structure of your data, maintaining high standards and ensuring data quality. This keeps your tables clean and purposeful, allowing them to serve their function effectively.

However, if you decide after careful consideration that adding a new column is necessary, you can easily accomplish this using the schema evolution feature.

Schema evolution is activated by adding .option('mergeSchema', 'true') to your .write or .writeStream Spark command.

By using schema enforcement in Databricks, our example Retailer can ensure that all incoming customer data adheres to the predefined schema. This means any data with unexpected columns or incorrect formats is automatically flagged or rejected before being written.

Now consider the same retailer encounters problems with customer data entries, particularly with missing or incorrect email addresses. Some entries have null values or invalid formats, leading to failed communications and ineffective marketing campaigns.

Constraints

Databricks supports standard SQL constraint management clauses. When constraints are enforced, they ensure that the quality and integrity of data added to a table are automatically verified. Two types of constraints are supported with Delta tables:

NOT NULL: prevents any NULL values from being inserted into the column
CHECK: requires that the specified Boolean expression must be valid for each input row

When a constraint is violated for any row, the entire transaction will fail, raise an error and roll back.

--Add Constraint
ALTER TABLE customer ALTER COLUMN Cust_id SET NOT NULL;
ALTER TABLE customer ADD CONSTRAINT validIds CHECK (Cust_id > 1 and Cust_id < 99999999);

--Drop Constraint
ALTER TABLE customer ALTER COLUMN Cust_id DROP NOT NULL;
ALTER TABLE customer DROP CONSTRAINT valid

--Display Constraint
DESCRIBE DETAIL customer;
SHOW TBLPROPERTIES customer (Displays the results in a relatively easier-to-read format)

The two examples below show how constraints can be added, removed and displayed.

By implementing constraints in Databricks, the retailer can set a NOT NULL constraint on the "Email" column to ensure that every customer record includes an email address. Additionally, a CHECK constraint can be applied to validate the format of email addresses using a regular expression. This ensures that only valid and complete email addresses are entered into the system, improving the effectiveness of marketing efforts and enhancing customer engagement.

Our retailer now faces challenges with data quality over time, particularly with the consistency and accuracy of customer transaction data. As data is collected from multiple channels, there can be discrepancies in the number of transactions recorded or anomalies in spending patterns that could indicate data quality issues such as missing or duplicated records.

Lakehouse Monitoring

Lakehouse Monitoring allows you to monitor the quality of all your tables registered in UC. To use Lakehouse Monitoring, your workspace must be UC enabled, and you must have access to Databricks SQL.

Users can configure a table to be monitored using the Databricks UI or the API. Lakehouse Monitoring provides the following types of analysis:

Time series profile: Use for tables that contain a time series dataset based on a timestamp column. Monitoring computer data quality metrics across time-based windows of the time series
Snapshot profile: Any Delta managed or external table.
Inference profile: Use for a table containing predicted values output by a machine learning classification or regression model. This table includes a timestamp, a model ID, model inputs (features), a column containing model predictions, and optional columns containing unique observation IDs and ground truth labels. It can also contain metadata, such as demographic information, which is not used as input to the model but might be helpful for fairness and bias investigations or other monitoring.

A table monitor produces two tables and a dashboard automatically.

The profile metrics table contains summary statistics
The drift metrics table includes statistics about the data’s drift over time.

Refer to the official documentation (AWS | Azure) to learn how to use the Lakehouse Monitoring.

You can also configure custom metrics, which is useful when you want to create specific measures that capture the aspects of business logic. You can use different types of custom metrics.

Aggregate metrics: These are calculated based on columns in the primary table and are stored in the profile metrics table.
Derived metrics: These are calculated based on previously computed aggregate metrics and do not directly use data from the primary table. They are also stored in the profile metrics table.
Drift metrics: These compare previously computed aggregate or derived metrics from two different time windows or between the primary and baseline tables. They are stored in the drift metrics table.

For example, you want to track the average price of orders in your orders table.

Let’s take another example; let’s say you want to track the delayed orders (orders that are open in state after 30 days of their creation date):

Once the metrics are saved, they will appear in the Lakehouse Monitoring tab as shown below:

In addition to various other metrics, the custom metrics are stored in the Profile metrics table.

Note: If you have a large number of tables to monitor, you can use the API to efficiently enable monitoring across multiple tables programmatically.

By implementing Lakehouse Monitoring in Databricks, the retailer can continuously track the quality of their customer transaction data. Lakehouse Monitoring allows the retailer to set up automated profiling and drift analysis on their transaction tables. This setup helps detect changes in data distribution, such as unexpected spikes or drops in transaction volumes or inconsistencies in spending patterns.

Our retailer also experiences occasional data quality issues, such as unexpectedly low transaction numbers recorded in their Gold layer tables due to upstream system errors or data ingestion failures. These issues can lead to incomplete customer profiles and inaccurate reward calculations, affecting customer trust and engagement.

Setting up Alerts

Now that your metrics table is prepared, you can configure alerts to notify you of potential errors, helping to mitigate downstream risks.For instance, you can configure an alert to receive notifications if the percentage of nulls or zeros surpasses a defined threshold or shows changes over time.

Let's look at an example of configuring alerts for our custom metrics, such as avg_price and delayed_orders.

Create a SQL query and name it “delayed_order_count.”

Select COUNT(delayed_orders) FROM malay_demo.tpch.orders_profile_metrics;

Set up the alert with the following configuration:

In the same way, you can configure alerts for key metrics to take proactive steps in maintaining the overall health and reliability of your data. To visualise these metrics, Lakehouse Monitoring offers a customisable, out-of-the-box dashboard

By using Databricks SQL alerts, the retailer can set up automated notifications to monitor key metrics related to transaction data quality.

Unfortunately, the Retailer discovers discrepancies in customer purchase data, leading to incorrect reward calculations and customer dissatisfaction. These discrepancies arise from errors during data integration from multiple sources, making it difficult to pinpoint where the issue originated.

Data Lineage

Data lineage is integral to data governance, providing valuable insights that indirectly support data quality management. Through the Unity Catalog, Databricks offers detailed data lineage tracking, capturing data flow across queries, notebooks, jobs, and dashboards. This comprehensive view extends to the column level, allowing users to see how data is transformed and utilised across various processes within the organisation.

Data lineage offers a transparent view of the data's journey from its origin to its final destination, detailing its transformations, the systems it flows through, and the business processes it supports. This traceability helps pinpoint where errors may have been introduced into the data
It enables you to identify downstream objects affected by the quality issue.

While data lineage does not directly monitor or ensure data quality, it is instrumental in identifying and resolving issues that may affect it. By tracing the path of data from its origin to its final destination, lineage helps pinpoint where errors or inconsistencies may have been introduced. This capability enables data teams to quickly diagnose problems and implement corrective actions, thereby maintaining the integrity and reliability of the data.

Additionally, understanding data lineage aids in impact analysis when changes are made to data sources or processing workflows. It ensures that modifications do not negatively affect downstream applications or analyses, thus preserving data quality. Moreover, lineage provides transparency and accountability, crucial for compliance with regulatory requirements and internal standards.

By utilising data lineage through Databricks' Unity Catalog, our example retailer can trace the entire journey of customer data from its source to its final destination. This capability allows them to easily and accurately identify where errors or inconsistencies were introduced during data transformations or integrations.

Conclusion

In conclusion, ensuring robust data quality is critical for maintaining your data ecosystem's integrity, reliability, and accuracy. Leveraging tools such as Constraints, Lakehouse Monitoring, customisable dashboards, alerts, and data lineage capabilities empower organisations to proactively detect, address, and prevent data issues before they impact downstream processes. Adopting a comprehensive approach to data quality management can build trust in your data, improve decision-making, and enhance overall business outcomes.

Databricks Community

Achieving High Data Quality with Databricks SQL: Techniques and Tools

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks