Databricks Community

MichTalebzadeh · ‎03-19-2024

Introduction

Financial fraud is a significant concern for businesses and consumers alike. I have written about this concern a few times in Linkedlin articles. Machine learning offers powerful tools to combat this issue by automatically identifying suspicious transactions. This article explores how a Random Forest classifier can be implemented to detect fraudulent activities within transactional data.

I will go into the steps involved in building this fraud detection model, including data pre-processing, model training, and performance evaluation. By analysing features like transaction amount, merchant, and category, the model learns to distinguish between legitimate and fraudulent transactions. The effectiveness of the model will be measured using Area Under the ROC Curve (AUC), a metric that indicates the model's ability to differentiate between fraudulent and valid transactions.

The Approach: Building the Fraud Detection Model

We discuss the specific tools and techniques used to construct the fraud detection model. Here, we will explore the functionalities of each artefact.

I will go into the steps involved in building this fraud detection model, including data pre-processing, model training, and performance evaluation. By analysing features like transaction amount, merchant, and category, the model learns to distinguish between legitimate and fraudulent transactions. The effectiveness of the model will be measured using Area Under the ROC Curve (AUC), a metric that indicates the model's ability to differentiate between fraudulent and valid transactions and how they contribute to the overall process.

1) Data Generation with Faker:

Function: We used Faker, a Python library that generates realistic synthetic data, to create 1000 synthetic transactions for this fraud detection example.

Role in Fraud Detection: In this scenario, Faker can be used to create a large dataset of simulated transactions with varying characteristics. This dataset can be helpful for training and evaluating the machine learning model without relying on real, potentially sensitive financial data.

Benefits: Using Faker allows for controlled data generation, enabling the creation of specific fraud scenarios and exploration of diverse transaction patterns.

2) Data Processing with Apache Spark:

Function: Apache Spark is a powerful open-source framework for large-scale data processing.

Role in Fraud Detection: Spark is well-suited for handling the potentially vast amount of transaction data often encountered in fraud detection systems. It efficiently processes, transforms, and analyzes the data to prepare it for machine learning tasks.

Spark DataFrame: This distributed data structure within Spark stores and manages the transactional data, facilitating manipulation and analysis.

Benefits: Spark's distributed processing capabilities enable efficient handling of large datasets, making it a valuable tool for real-world fraud detection systems.

3) Machine Learning with Spark MLlib:

Function: Spark MLlib is Spark's machine learning library, providing tools for building and deploying various machine learning models.

Role in Fraud Detection: In this application, Spark MLlib is used to implement the Random Forest classification algorithm. The model analyses transaction features like amount, merchant, and category, learning to distinguish fraudulent activities from legitimate transactions.

StringIndexer and OneHotEncoder: These tools pre-process categorical data (e.g., merchant names, categories) by converting them into numerical representations suitable for machine learning algorithms.

Random Forest Classifier: This supervised learning algorithm is trained on labelled data (transactions identified as fraudulent or legitimate) to learn patterns that differentiate fraudulent activities.

Benefits: Spark MLlib offers a robust and scalable environment for training machine learning models on large datasets, making it ideal for building fraud detection systems.

By combining these artifacts, we will construct a comprehensive fraud detection system that leverages synthetic data generation, large-scale data processing, and machine learning for effective fraud identification.

Key Steps: Building the Fraud Detection Pipeline

This section outlines the essential steps involved in constructing the fraud detection model using machine learning. Each step plays a crucial role in preparing, training, and evaluating the model for effective fraud identification.

Now, let us add inline comments to the code for better understanding of each step.

# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer, OneHotEncoder
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql.functions import when

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Fraud Detection") \
    .getOrCreate()

# Set the log level to ERROR to reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")

# Read text data into a Spark DataFrame
DIRECTORY="/d4T/hduser/genai"
transaction_file=f"file://{DIRECTORY}/transactions.txt"
# Define the schema of the DataFrame
schema = "transaction_id INT, amount DOUBLE, merchant STRING, category STRING, is_fraud BOOLEAN"

# Read the text file using the defined schema and tab as the delimiter
data = spark.read.csv(transaction_file, schema=schema, sep="\t", header=True)
data.printSchema()
print(f"{data.count()} rows read in from {transaction_file}")
# Show the DataFrame
#data.show(truncate=False)
# Preprocess data
# Convert is_fraud column from boolean to numeric
data = data.withColumn("is_fraud_numeric", when(data["is_fraud"] == True, 1).otherwise(0))

# String Indexing for categorical columns
merchant_indexer = StringIndexer(inputCol="merchant", outputCol="merchant_index")
category_indexer = StringIndexer(inputCol="category", outputCol="category_index")

# One-Hot Encoding for indexed categorical columns
merchant_encoder = OneHotEncoder(inputCol="merchant_index", outputCol="merchant_encoded")
category_encoder = OneHotEncoder(inputCol="category_index", outputCol="category_encoded")

# Assemble feature vector
assembler = VectorAssembler(inputCols=['transaction_id', 'amount', 'merchant_encoded', 'category_encoded'], outputCol='features')

# Pipeline for preprocessing steps
pipeline = Pipeline(stages=[merchant_indexer, category_indexer, merchant_encoder, category_encoder, assembler])

# Fit pipeline on data
pipeline_model = pipeline.fit(data)

# Transform data
data = pipeline_model.transform(data)

# Split data into training and test sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

# Train a Random Forest classifier
rf = RandomForestClassifier(labelCol='is_fraud_numeric', featuresCol='features', numTrees=100)
model = rf.fit(train_data)

# Evaluate model performance on the training data
predictions = model.transform(train_data)
# Make predictions on test data
predictions_test = model.transform(test_data)

# Print the test dataset
print("Test Dataset:")
test_data.show(truncate=False)
print("Training Dataset:")
train_data.show(truncate=False)

# Evaluate model performance
evaluator = BinaryClassificationEvaluator(labelCol='is_fraud_numeric', metricName='areaUnderROC')
auc_train = evaluator.evaluate(predictions)
print("Area Under ROC (Training):", auc_train)
# Evaluate model performance on the test data
auc_test = evaluator.evaluate(predictions_test)
print("Area Under ROC (Test):", auc_test)
auc = evaluator.evaluate(predictions)
print("Area Under ROC:", auc)

# Stop SparkSession
spark.stop()

In this code:

1) Import Necessary Libraries:

At the beginning, we import the necessary modules for working with Spark (pyspark.sql.SparkSession), machine learning algorithms (from pyspark.ml.classification import RandomForestClassifier, from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler), and evaluation metrics (from pyspark.ml.evaluation import BinaryClassificationEvaluator).

2) Initialize SparkSession:

A SparkSession object is created to interact with data using Spark's distributed processing capabilities. We can optionally set the logging level to "ERROR" for a cleaner console output during execution.

3) Load Synthetic Data (Replace with Real Data if Available):

A DataFrame is created to hold the transaction data. This data typically includes columns for transaction_id, amount, merchant, category, and a boolean flag is_fraud indicating fraudulent transactions.

Note: In a real-world scenario, you would likely use actual historical transaction data instead of synthetic data.

4) Preprocess Data:

The is_fraud column is converted from boolean to numeric format (e.g., 0 for non-fraudulent, 1 for fraudulent) to be compatible with machine learning algorithms.
Categorical columns like merchant and category are handled using:
StringIndexer: This assigns unique numerical indices to each distinct value (e.g., "merchant_A" -> 0, "merchant_B" -> 1).
OneHotEncoder: This transforms the indexed categorical columns into sparse feature vectors suitable for machine learning models.
Finally, a VectorAssembler combines the numerical features (e.g., amount) with the encoded categorical features into a single feature vector for the model.

These preprocessing steps can be chained together into a pipeline for efficient execution.

5) Fit and Transform Data:

The constructed preprocessing pipeline is then fitted on the DataFrame, essentially training the transformations on the data.
The fitted pipeline is subsequently applied to the original DataFrame, transforming it into the preprocessed format required for the machine learning model.

6) Split Data:

The preprocessed DataFrame is divided into two sets:

Training Set (80%): This larger portion is used to train the machine learning model.
Testing Set (20%): This smaller portion is used to evaluate the model's performance on unseen data.

7) Train Random Forest Classifier:

A Random Forest classifier model is created, typically specifying the number of trees to be grown (e.g., 100 trees in this example).
The model is then trained using the training set, allowing it to learn patterns that differentiate fraudulent transactions from legitimate ones.

😎Evaluate Model Performance:

The trained model is used to make predictions on both the training and testing sets.
The Area Under the ROC Curve (AUC) metric is calculated using a BinaryClassificationEvaluator. AUC measures the model's ability to distinguish between fraudulent and valid transactions (higher AUC indicates better performance).
The AUC values for both training and testing sets are printed to assess the model's effectiveness.

9) Display Data (Optional):

Optionally, a small sample of the training and testing sets can be printed to visualize the preprocessed data format.

10) Stop SparkSession:

Finally, the SparkSession is terminated to release resources and avoid memory leaks.

By following these steps, we can construct a robust fraud detection pipeline that leverages machine learning to identify and prevent fraudulent activities within transactional data.

Fraud Detection Model Performance

root
 |-- transaction_id: integer (nullable = true)
 |-- amount: double (nullable = true)
 |-- merchant: string (nullable = true)
 |-- category: string (nullable = true)
 Fraud Detection Model Performance
|-- is_fraud: boolean (nullable = true)

1000 rows read in from file:///d4T/hduser/genai/transactions.txt
Smple Test Dataset:
+--------------+------+------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
|transaction_id|amount|merchant    |category|is_fraud|is_fraud_numeric|merchant_index|category_index|merchant_encoded |category_encoded |features                                   |
+--------------+------+------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
|253           |35.15 |Sullivan Inc|better  |false   |0               |833.0         |104.0         |(971,[833],[1.0])|(619,[104],[1.0])|(1592,[0,1,835,1077],[253.0,35.15,1.0,1.0])|
|257           |41.8  |Hayes Group |wish    |false   |0               |362.0         |269.0         |(971,[362],[1.0])|(619,[269],[1.0])|(1592,[0,1,364,1242],[257.0,41.8,1.0,1.0]) |
+--------------+------+------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
only showing top 2 rows

Sample Training Dataset:
+--------------+------+-----------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
|transaction_id|amount|merchant         |category|is_fraud|is_fraud_numeric|merchant_index|category_index|merchant_encoded |category_encoded |features                                   |
+--------------+------+-----------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
|251           |81.51 |Mckenzie-Mitchell|dreame   |true    |1               |561.0         |134.0         |(971,[561],[1.0])|(619,[134],[1.0])|(1592,[0,1,563,1107],[251.0,81.51,1.0,1.0])|
|252           |99.22 |Farrell-Cruz     |author  |false   |0               |241.0         |6.0           |(971,[241],[1.0])|(619,[6],[1.0])  |(1592,[0,1,243,979],[252.0,99.22,1.0,1.0]) |
+--------------+------+-----------------+--------+--------+----------------+--------------+--------------+-----------------+-----------------+-------------------------------------------+
only showing top 2 rows

Area Under ROC (Training): 0.8304320646293669
Area Under ROC (Test): 0.4486829615567157
Area Under ROC: 0.830432064629367

1) Data Schema:

Outlines the structure of the dataset, including columns and their data types.
Key columns for fraud detection: transaction_id, amount, merchant, category, and is_fraud.

2) Data Loading:

Indicates that 1000 rows were read from a specific file.

3) Data Preprocessing:

Shows a sample of the preprocessed data, including:

Numerical features: transaction_id and amount.

Encoded categorical features: merchant_encoded and category_encoded.
Combined features vector: features.

4) Model Training and Evaluation:

Samples of both training and test datasets are displayed.

Area Under ROC (AUC) scores for both datasets are reported:
AUC (Training): 0.83 (suggesting good model performance on training data).
AUC (Test): 0.45 (indicating potential overfitting or issues with generalization).

Summary

Data Structure: Understanding the relationships between features is crucial for fraud detection.
Preprocessing: Categorical features need encoding for compatibility with machine learning models.
Evaluation: AUC is a good metric for assessing model performance in fraud detection.
Overfitting Concern: The difference between training and test AUC scores suggests potential model overfitting.

Disclaimer: I am more of an architect and less of a data scientist. The information provided is correct to the best of my knowledge. It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

Appendix

https://github.com/joke2k/faker

Faker code to generate 1000 synthesised transactions

pip install Faker

from pyspark.sql import SparkSession
from faker import Faker
from pyspark.sql.functions import cast

fake = Faker()
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Fraud Detection") \
    .getOrCreate()

# Set the log level to ERROR to reduce verbosity
sc = spark.sparkContext
sc.setLogLevel("ERROR")

# Generate synthetic transaction data
num_transactions = 1000  # Number of transactions you want to generate

transactions = [
    (
        idx + 1,  # Transaction ID starts from 1
        fake.random_number(digits=4, fix_len=True) / 100,  # Random amount (e.g., 4 digits with 2 decimal places)
        fake.company(),  # Random merchant name
        fake.word(),  # Random category
        fake.boolean()  # Random boolean indicating whether it's fraudulent or not
    )
    for idx in range(num_transactions)
]

# Create DataFrame
data = spark.createDataFrame(transactions, ["transaction_id", "amount", "merchant", "category", "is_fraud"])
data = data.withColumn("transaction_id_str", data["transaction_id"].cast("string"))

# Show the generated data
data.show(truncate=False)

try:
    DIRECTORY = "/d4T/hduser/genai"
    transactions_file = f"file://{DIRECTORY}/transactions.txt"

    # Use DataFrameWriter with 'mode("overwrite")'
    data.write.mode("overwrite").csv(transactions_file, sep="\t", header=True)
    print(f"Data saved successfully with {data.count()} rows!")
except Exception as e:
    print(f"Error saving data: {e}")

Adjust these two to your own settings

DIRECTORY = "/d4T/hduser/genai"
transactions_file = f"file://{DIRECTORY}/transactions.txt"

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

deborah621 · ‎03-20-2024

Looking to build a machine learning model for detecting fraudulent transactions using PySpark’s MLlib. Generate synthetic transaction data. Provides a dataset for model training without using sensitive real-world data. Enables the creation of diverse transaction patterns for robust model training.

MichTalebzadeh · ‎03-20-2024

Thanks, for feedback
Reading through your comment, I believe this is what is meant for further clarity.

"To build a machine learning model for detecting fraudulent transactions using PySpark's MLlib, synthetic transaction data can be generated. This provides a safe and ethical dataset for model training without compromising sensitive real-world information. By generating diverse transaction patterns, synthetic data helps create a robust model capable of identifying fraudulent activity"

Mich Talebzadeh | Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".