Fraud Detection Project Prototype

General Solution Description

Our solution will be a mobile app that could automatically monitor any ongoing calls, emails, and texts. When fraudulent behavior is detected, it will send an alert to our users. Therefore, depending on likelihood of the calls, emails, and texts being real frauds, the alerts will be different. The alert output, thus, is a multi class classification of the overall confidence score of fraud detection:

Most likely a scam (75~ 100)
Likely a scam (50~74)
Likely not a scam (25~49)
Most likely not a scam (0~24)

If a call, text messages, or an email seems to most likely be a scam, we will send alerts to both our users and their family members, this warning will ignore any do not disturb settings, so even though they are on the call, they will still see the alert. If it is likely a scam, we will send them a notification to read the fraud analysis on our app. For likely not a scam and most likely not a scam, they won’t receive any notification. However, for any calls, emails, and texts, there will be a fraud analysis generated by our app no matter the likelihood of scam. We will constantly post educational content on fraud prevention, explain new types of scams to enhance our users' ability to identify scams. We will also collect well known scam area codes and other data and specifically inform our users so that if for some reason our system doesn’t warn them, they can have some background knowledge that for example a Montréal area code phone number could be a scam since the user is in a different province.

Pipeline sequence flow chart

Pipeline Sequence Description

Our pipeline sequence, indicating how we train the solution, consists of 6 steps:

Data Collection
Feature engineering
Model Training
Validation
Testing and deployment
Monitoring and Maintenance

(in which steps 3 and 4 are iterative processes).

We start off with collecting as many data as possible in 3 data formats (calls, text messages, and emails with pdf documents). We first convert these to text data, using speech recognition for calls and document intelligence for emails. We then handle missing values, remove stop words like “um” and “the”, and of course correct errors. With the addition of labels indicating the likelihood of each file being a scam, we perform feature engineering. We have 5 features, where each one is a numeric value from 0 to 1, indicating confidence score. Our first feature is professional tone, which means how professional the text sounds. Some scammers tend to represent themselves with unprofessional manner, and that is why we decided to add this. To develop this feature, we customize the sentiment analysis code through Jupyter Notebook to ensure that the algorithm produces an output score of professional tone. Feature data 2, 3, 4 are gathered by utilizing named entity recognition to extract suspicious links, text referring to money, and text referring to personal information. The last feature is whether the text describes emergency or something too good to be true, such as winning a lottery, for example. We use default sentiment analysis to get the confidence score of how positive or negative the text is and key phrase extraction to summarize the text overall with a few words. We then take the z-score to standardize and split the data into training dataset and test dataset, which we would put into algorithm selected by Automated ML from Azure AI to train the model. The next step is a loop of grid search, essentially re-training based on the evaluation metrics until the evaluation result is more than satisfactory. Finally, following final evaluation, we develop API interfaces to integrate the model into production environment, in which we have additional steps to monitor the behavior of the model by collecting stats and user feedback and periodically retrain the model to apply user feedback and evolving fraud methods.