Databricks Community

alan_mazan · ‎10-28-2024

Data cleaning is an essential data preprocessing step in preparing data for machine learning. The quality of data directly impacts model performance, and these processes ensure that the data is accurate, consistent, complete, free of outliers and in a suitable format for algorithms. Below, we explore critical data cleaning techniques followed by practical examples in Pandas. It should give data practitioners a quick practical reference.

Content

Before we begin
Handling Missing Data
Handling Outliers
Handling Duplicates
Handling Data Inconsistency
Correcting Data Types
Encoding Categorical Features
Standardisation and Normalisation
Further learning

Before we begin

Run the code below in a notebook to follow along. You will need Pandas, Numpy, Sklearn and Scipy. On Databricks you can simply use a "Databricks Runtime for Machine Learning".
To build scalable data pipelines on large datasets I recommend implementing these techniques with PySpark. You can also use Pandas API for Spark.
Some techniques involve data transformations. Transformed variables lose their original meaning. Therefore, one cannot interpret results for instance in a linear regression coefficient analysis. When using a transformed variable as the target of a regression, the inverse transformation on the predicted variable makes it interpretable again.

Handling Missing Data

Several machine learning models assume that there is no missing data. Some strategies to cope with this include:

Dropping rows with missing values: Removing rows with missing data. When there are many missing values this can lead to a significant drop in available rows.
Removing sparse columns: Columns with excessive missing data are often removed (e.g. 50% missing).
Imputation: Filling in missing values with appropriate substitutes. For numeric values this is often the mean, median or 0 while for categorical ones it is often the mode (most common value) or a placeholder category e.g. "NA".

Example: Handling Missing Data

import pandas as pd
import numpy as np

# Sample data with missing values
data = {'A': [1, 2, np.nan, 4],
        'B': [5, np.nan, np.nan, 8],
        'C': ['a', 'b', 'c', np.nan],
        'D': [1, np.nan, np.nan, np.nan]}

df = pd.DataFrame(data)

# Drop rows with missing values
df_row_dropped = df.dropna()

# Impute missing values with 0
df_filled = df.fillna(0)

# Identify columns with more than 50% missing values (sparse columns)
threshold = 0.5 * len(df)
sparse_columns = df.columns[df.isnull().sum() > threshold]

# Drop sparse columns
df_col_dropped = df.drop(columns=sparse_columns)

Handling Outliers

Outliers can distort model performance, leading to skewed predictions especially in models sensitive to data distribution (e.g., linear regression). Outliers can be single-dimensional or multi-dimensional. A multi-dimensional outlier is a data point that may not be an outlier in any single dimension but is an outlier when considering combinations of features. Common strategies to address outliers:

Removing outliers: Using statistical methods like Z-scores, interquartile range (IQR), Mahalanobis Distance (multi-dimensional) to identify outliers for removal.
Transforming dataset: Rescale the dataset to e.g. logarithmic scale reduce the overall impact of outliers.

Example 1: Handling Outliers with Z-Scores

Z-Scores are best used when the data follows a normal distribution and when not too many or too extreme outliers are expected. They are good at identifying subtle outliers.

from scipy import stats
import pandas as pd
import numpy as np

# Sample data
data = {'TotalCharges': [100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}
df = pd.DataFrame(data)

# Calculate Z-scores
df['Z_Score'] = np.abs(stats.zscore(df['TotalCharges']))

# Filter out outliers
df_no_outliers = df[df['Z_Score'] < 2]

Example 2: Handling Outliers with Interquartile Range (IQR)

IQR is robust so it can handle non-normal data and many extreme outliers well. It is less suitable to identify less extreme values that are still anomalies.

import pandas as pd

# Sample data
data = {'TotalCharges': [100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}

df = pd.DataFrame(data)

# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['TotalCharges'].quantile(0.25)
Q3 = df['TotalCharges'].quantile(0.75)

# Calculate IQR
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_no_outliers = df[(df['TotalCharges'] >= lower_bound) & (df['TotalCharges'] <= upper_bound)]

Example 3: Handling Multi-Dimensional Outliers Using Mahalanobis Distance

For multivariate data Mahalanobis distance can be a suitable detection technique. It is more complex and relies on inverting a covariance matrix which can make it computationally expensive.

import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from numpy.linalg import inv

# Sample data with two features
data = {'Feature1': [1, 2, 3, 4, 5, 6],
        'Feature2': [5, 4, 3, 2, 1, 6]}

df = pd.DataFrame(data)

# Calculate the mean of the features
mean_vector = df.mean()

# Calculate the covariance matrix of the features
cov_matrix = np.cov(df.T)

# Invert the covariance matrix
inv_cov_matrix = inv(cov_matrix)

# Function to calculate Mahalanobis distance
def calculate_mahalanobis(row, mean_vector, inv_cov_matrix):
    diff = row - mean_vector
    return mahalanobis(diff, np.zeros_like(diff), inv_cov_matrix)

# Apply the Mahalanobis distance function to each row
df['Mahalanobis_Distance'] = df.apply(lambda row: calculate_mahalanobis(row[:2], mean_vector, inv_cov_matrix), axis=1)

# Set a threshold for identifying outliers (e.g., a distance greater than 3 standard deviations). 
threshold = 2  # For simplicity set to 2

# Flag points with a Mahalanobis distance greater than the threshold as outliers
df['Outlier'] = df['Mahalanobis_Distance'] > threshold

Example 4: Transforming Outliers with Log Transformation

This technique does not remove outliers but transforms the entire variable to make the distribution more normal-like. It allows keeping outliers while reducing their negative impact. Therefore it is suitable for data with heavy skew. Since log is undefined for negative values and infinite for 0, a constant might have to be added to make the variable positive.

import pandas as pd
import numpy as np

# Sample data
data = {'TotalCharges': [-10, 100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}

df = pd.DataFrame(data)

# Log transform 'TotalCharges' to reduce the effect of outliers
df['Log_TotalCharges'] = np.log(df['TotalCharges'] + 11) 
df['Original_TotalCharges'] = round(np.exp(df['Log_TotalCharges'])) - 11 # Convert back to original values

Handling Duplicates

Duplicate records can introduce bias or errors in model training. Duplicates often arise from repeated entries in data collection. This is usually addressed by deduplicating data.

Example: Handling Duplicates

import pandas as pd

# Sample data with duplicates
data = {'A': [1, 2, 2, 4],
        'B': [5, 6, 6, 8],
        'C': ['a', 'b', 'b', 'd']}

df = pd.DataFrame(data)

# Identify duplicate rows
duplicates = df.duplicated() # df.duplicated(subset=['A', 'B'])

# Remove duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()

Handling Data Inconsistency

Inconsistent data occurs when the same information is represented differently across records. This can include inconsistent capitalisation, abbreviations, or formatting (e.g., 'NY' vs. 'New York'). To ensure consistency:

Standardise formats: Ensure consistent representations.
Correct inconsistencies: Apply uniform formats using methods like string normalisation.

Example: Handling Data Inconsistency

import pandas as pd

# Sample data with inconsistent representations
data = {'City': ['New York', 'new york', 'NY', 'nyc']}
df = pd.DataFrame(data)

# Standardize the 'City' column by converting all values to lowercase
df['City_Standardized'] = df['City'].str.lower()

# Replace abbreviations with full names
df['City_Standardized'] = df['City_Standardized'].replace({'ny': 'new york', 'nyc': 'new york'})

In data engineering it is a good practice to create data quality expectations for raw data to test for inconsistencies. Also one should monitor quality statistics for feature tables to identify more subtle data quality issues. For a deep dive on data quality I recommend reading Data Quality Management on Databricks.

Correcting Data Types

Data often gets stored in incorrect formats, which can hinder model training. Correcting data types is essential for both memory optimisation and ensuring compatibility with algorithms. Typical corrections involve casting string dates and times to timestamps, string numbers to numerics or binary integers to booleans.

Example: Correcting Data Types

import pandas as pd

# Sample data with incorrect data types
data = {'Age': ['26', '32', '22'], 'DateOfBirth': ['1999-05-21', '1992-02-14', '2002-07-30']}

df = pd.DataFrame(data)

# Convert 'Age' from string to numeric
df['Age'] = pd.to_numeric(df['Age'])

# Convert 'SeniorCitizen' to boolean type
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])

Encoding Categorical Features

Machine learning algorithms typically require numerical inputs, so categorical features must be encoded. Two common encoding methods include:

One-hot encoding: Converts each category into a new binary column. It is typically used when there is no inherent ordering between the categories and when there is a small number of categories. With many categories the dataset can quickly suffer from high-dimensionality.
Label encoding: Assigns a unique integer to each category. This is more suitable when there are many categories. However, it imposes a numerical relationship onto the categories which can affect interpretability and predictive performance.

Example: One-Hot and Label Encoding

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Sample data
data = {'Category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)

# One-hot encode categorical features
df_one_hot = pd.get_dummies(df, columns=['Category'], drop_first=True)

# Label encode categorical features
label_encoder = LabelEncoder()
df['Category_LabelEncoded'] = label_encoder.fit_transform(df['Category'])

Standardisation and Normalisation

Standardisation and normalisation are techniques used to adjust the scale of features, ensuring that all features contribute equally to the model.

Standardisation scales data so that it has a mean of 0 and a standard deviation of 1. It’s often used in algorithms that assume normally distributed data (e.g., linear regression, SVMs).
Normalisation rescales the data to a fixed range, typically between 0 and 1. It’s commonly used when the goal is to ensure that all features are on the same scale but not necessarily normally distributed.

Example: Standardisation and Normalisation

import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Sample data
data = {'Feature1': [100, 200, 300], 'Feature2': [4000, 5000, 6000]}
df = pd.DataFrame(data)

# Standardization (mean = 0, standard deviation = 1)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Normalization (range 0 to 1)
normalizer = MinMaxScaler()
df_normalized = pd.DataFrame(normalizer.fit_transform(df), columns=df.columns)

Further learning

The methods presented cover the most common scenarios of data cleaning. I recommend diving deeper into these topics as it becomes relevant to your work. Data cleaning is the first step in robust data preprocessing. Overall data preprocessing encompasses additional topics such as:

Feature engineering
Data splitting, cross-validation and bootstrapping
Feature selection
Dimension reduction
Testing for the assumptions of a model e.g. absence of collinearity in linear regression
Handling imbalanced data

After extensive data preprocessing it is good practice to save the resulting clean machine learning features in a feature store. A feature store provides a centralised repository for managing, sharing, and serving machine learning features, ensuring consistency between training and inference while enabling feature reuse and governance across teams and models.

Some good resources for further learning that I can recommend are:

Data Preparation for Machine Learning on Databricks training. It gives a good overview of data cleaning and feature engineering with Spark
An Introduction to Statistical Learning by James et al. It gives a good theoretical overview of data preprocessing and beyond with practical applications in Python.
Data Cleaning by Ilyas and Chu. It deep dives on the data cleaning techniques mentioned here.