Data cleaning is an essential data preprocessing step in preparing data for machine learning. The quality of data directly impacts model performance, and these processes ensure that the data is accurate, consistent, complete, free of outliers and in a suitable format for algorithms. Below, we explore critical data cleaning techniques followed by practical examples in Pandas. It should give data practitioners a quick practical reference.
Content
Several machine learning models assume that there is no missing data. Some strategies to cope with this include:
import pandas as pd
import numpy as np
# Sample data with missing values
data = {'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': ['a', 'b', 'c', np.nan],
'D': [1, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data)
# Drop rows with missing values
df_row_dropped = df.dropna()
# Impute missing values with 0
df_filled = df.fillna(0)
# Identify columns with more than 50% missing values (sparse columns)
threshold = 0.5 * len(df)
sparse_columns = df.columns[df.isnull().sum() > threshold]
# Drop sparse columns
df_col_dropped = df.drop(columns=sparse_columns)
Outliers can distort model performance, leading to skewed predictions especially in models sensitive to data distribution (e.g., linear regression). Outliers can be single-dimensional or multi-dimensional. A multi-dimensional outlier is a data point that may not be an outlier in any single dimension but is an outlier when considering combinations of features. Common strategies to address outliers:
Z-Scores are best used when the data follows a normal distribution and when not too many or too extreme outliers are expected. They are good at identifying subtle outliers.
from scipy import stats
import pandas as pd
import numpy as np
# Sample data
data = {'TotalCharges': [100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}
df = pd.DataFrame(data)
# Calculate Z-scores
df['Z_Score'] = np.abs(stats.zscore(df['TotalCharges']))
# Filter out outliers
df_no_outliers = df[df['Z_Score'] < 2]
IQR is robust so it can handle non-normal data and many extreme outliers well. It is less suitable to identify less extreme values that are still anomalies.
import pandas as pd
# Sample data
data = {'TotalCharges': [100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}
df = pd.DataFrame(data)
# Calculate Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['TotalCharges'].quantile(0.25)
Q3 = df['TotalCharges'].quantile(0.75)
# Calculate IQR
IQR = Q3 - Q1
# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter out outliers
df_no_outliers = df[(df['TotalCharges'] >= lower_bound) & (df['TotalCharges'] <= upper_bound)]
For multivariate data Mahalanobis distance can be a suitable detection technique. It is more complex and relies on inverting a covariance matrix which can make it computationally expensive.
import numpy as np
import pandas as pd
from scipy.spatial.distance import mahalanobis
from numpy.linalg import inv
# Sample data with two features
data = {'Feature1': [1, 2, 3, 4, 5, 6],
'Feature2': [5, 4, 3, 2, 1, 6]}
df = pd.DataFrame(data)
# Calculate the mean of the features
mean_vector = df.mean()
# Calculate the covariance matrix of the features
cov_matrix = np.cov(df.T)
# Invert the covariance matrix
inv_cov_matrix = inv(cov_matrix)
# Function to calculate Mahalanobis distance
def calculate_mahalanobis(row, mean_vector, inv_cov_matrix):
diff = row - mean_vector
return mahalanobis(diff, np.zeros_like(diff), inv_cov_matrix)
# Apply the Mahalanobis distance function to each row
df['Mahalanobis_Distance'] = df.apply(lambda row: calculate_mahalanobis(row[:2], mean_vector, inv_cov_matrix), axis=1)
# Set a threshold for identifying outliers (e.g., a distance greater than 3 standard deviations).
threshold = 2 # For simplicity set to 2
# Flag points with a Mahalanobis distance greater than the threshold as outliers
df['Outlier'] = df['Mahalanobis_Distance'] > threshold
This technique does not remove outliers but transforms the entire variable to make the distribution more normal-like. It allows keeping outliers while reducing their negative impact. Therefore it is suitable for data with heavy skew. Since log is undefined for negative values and infinite for 0, a constant might have to be added to make the variable positive.
import pandas as pd
import numpy as np
# Sample data
data = {'TotalCharges': [-10, 100, 200, 3000, 400, 5000, 600, 7000, 800, 100000]}
df = pd.DataFrame(data)
# Log transform 'TotalCharges' to reduce the effect of outliers
df['Log_TotalCharges'] = np.log(df['TotalCharges'] + 11)
df['Original_TotalCharges'] = round(np.exp(df['Log_TotalCharges'])) - 11 # Convert back to original values
Duplicate records can introduce bias or errors in model training. Duplicates often arise from repeated entries in data collection. This is usually addressed by deduplicating data.
import pandas as pd
# Sample data with duplicates
data = {'A': [1, 2, 2, 4],
'B': [5, 6, 6, 8],
'C': ['a', 'b', 'b', 'd']}
df = pd.DataFrame(data)
# Identify duplicate rows
duplicates = df.duplicated() # df.duplicated(subset=['A', 'B'])
# Remove duplicate rows, keeping the first occurrence
df_no_duplicates = df.drop_duplicates()
Inconsistent data occurs when the same information is represented differently across records. This can include inconsistent capitalisation, abbreviations, or formatting (e.g., 'NY' vs. 'New York'). To ensure consistency:
import pandas as pd
# Sample data with inconsistent representations
data = {'City': ['New York', 'new york', 'NY', 'nyc']}
df = pd.DataFrame(data)
# Standardize the 'City' column by converting all values to lowercase
df['City_Standardized'] = df['City'].str.lower()
# Replace abbreviations with full names
df['City_Standardized'] = df['City_Standardized'].replace({'ny': 'new york', 'nyc': 'new york'})
In data engineering it is a good practice to create data quality expectations for raw data to test for inconsistencies. Also one should monitor quality statistics for feature tables to identify more subtle data quality issues. For a deep dive on data quality I recommend reading Data Quality Management on Databricks.
Data often gets stored in incorrect formats, which can hinder model training. Correcting data types is essential for both memory optimisation and ensuring compatibility with algorithms. Typical corrections involve casting string dates and times to timestamps, string numbers to numerics or binary integers to booleans.
import pandas as pd
# Sample data with incorrect data types
data = {'Age': ['26', '32', '22'], 'DateOfBirth': ['1999-05-21', '1992-02-14', '2002-07-30']}
df = pd.DataFrame(data)
# Convert 'Age' from string to numeric
df['Age'] = pd.to_numeric(df['Age'])
# Convert 'SeniorCitizen' to boolean type
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'])
Machine learning algorithms typically require numerical inputs, so categorical features must be encoded. Two common encoding methods include:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Sample data
data = {'Category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)
# One-hot encode categorical features
df_one_hot = pd.get_dummies(df, columns=['Category'], drop_first=True)
# Label encode categorical features
label_encoder = LabelEncoder()
df['Category_LabelEncoded'] = label_encoder.fit_transform(df['Category'])
Standardisation and normalisation are techniques used to adjust the scale of features, ensuring that all features contribute equally to the model.
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# Sample data
data = {'Feature1': [100, 200, 300], 'Feature2': [4000, 5000, 6000]}
df = pd.DataFrame(data)
# Standardization (mean = 0, standard deviation = 1)
scaler = StandardScaler()
df_standardized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Normalization (range 0 to 1)
normalizer = MinMaxScaler()
df_normalized = pd.DataFrame(normalizer.fit_transform(df), columns=df.columns)
The methods presented cover the most common scenarios of data cleaning. I recommend diving deeper into these topics as it becomes relevant to your work. Data cleaning is the first step in robust data preprocessing. Overall data preprocessing encompasses additional topics such as:
After extensive data preprocessing it is good practice to save the resulting clean machine learning features in a feature store. A feature store provides a centralised repository for managing, sharing, and serving machine learning features, ensuring consistency between training and inference while enabling feature reuse and governance across teams and models.
Some good resources for further learning that I can recommend are:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.