data scientist

Historical transaction data was used to predict fraudulent credit card transactions, followed by calibration of predicted probabilities to better approximate real world values. The motivation is to use calibrated probabilities to make classification decisions that minimize the loss incurred by fraudulent purchases.

**Analysis:** The models used in analysis include Gaussian Naive Bayes (NB), feed forward Neural Networks (NN) and Random Forests (RF). NB was used as a baseline, while the other models were used in actual prediction and calibration. Performance was measured using precision, recall, F1, and precision-recall area under curve (PR-AUC).

**Results and Conclusions:** For the data used in this analysis, the best RF model shows a better calibrated reliability curve and achieves a higher PR-AUC value — 0.88 vs. 0.78 for the NN. The performance differences between the two models were small when evaluated using risk minimization.

Cost analysis shows that the the choice of model depends on the internal cost in dollars (*C _{in}*) of acting on potential fraud. Except for intermediate

**Recommendations:**

- A larger dataset with more fraud instances would enable better analysis and calibration.
- Determine internal cost of acting on predicted fraud.
- Select a model based on the results of cost analysis, using the determined internal cost.

This report outlines the process of modeling historical transaction data (approximately 285,000 transactions) to predict credit card fraud, using Naive Bayes (NB), neural networks (NN), and random forest classifiers (RF). Modeling was followed by calibration of predicted probabilities, to better approximate real world values. The motivation behind the calibration is to make classification decisions that minimize the loss incurred by fraudulent purchases; in order to compute minimized risk correctly, it is necessary to have predicted probabilities that are as close as possible to real-world values.

The questions that motivated this study are:

- Can fraudulent credit card transactions be predicted from the available data with reasonable performance?
- How well can fraud be predicted using different models?
- Will there be a large trade off between precision and recall?
- How well do the top models minimize cost for different
*C*values?_{in} - What can be done to further minimize cost?

The dataset used in this analysis was obtained from Kaggle datasets and is structured as follows:

- 284,807 transactions with 492 classified as fraudulent (positive)
- positives account for only 0.17% of the set (extreme imbalance)

- 31 columns (Time, 28 anonymized numerical features, Amount, and Class)

Table 1: The first five rows of the raw transaction data.

Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | ... | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |

1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | ... | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |

2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | ... | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |

3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | ... | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |

4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | ... | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |

Time is a generic integer value that conveys little more than the order in which the transactions occur. The anonymized features are mean-centered numerical floats, each denoted by *Vi*, where *i* is the feature number (from 1-28). Amount refers to the amount of the transaction in dollars. Class has two possible values, zero or one, where a value of one indicates fraud.

This step was nothing more than splitting the data into three sets: training, validation, and test, followed by standardizing all features in the training set as ~N(0, 1) — then transforming the validation and test sets accordingly. This split was the only method of cross validation used for this project.

The anonymized features were grouped by class, and individually plotted as scatter and kernel density estimates (see figures 1 and 2), such that each plot contained one feature with both classes for comparison. The motivation behind this was to gain insights for feature selection: features with some separation between class distributions are likely to perform better for classification, while those with little or no separation are likely to add noise to the models. The latter is less of a problem for NN and RF, but eliminating features can cut down on training time. Also, distributions that are highly irregular should typically be excluded from NB models.

Many features are approximately normal, though some look like Skellam distributions and some range from slightly to extremely irregular. Those that would likely be excluded from NB models due to irregularity have a lot of overlap between classes, and will already be removed when selecting by separation of distributions.

Based on the findings in the last section, only one method of feature selection was used: filter out those with too much overlap between classes. This was done by computing summary statistics for each feature by class, and selecting those where the medians of each class had a minimum separation of *max(std class0, std class1)*. The features selected using this method were V[4, 11, 12, 14].

Because the data is extremely unbalanced (only a very small proportion of transactions are labeled as fraud), a few resampling methods were used to help offset this, and these methods were tested for each classifier. The methods used were random under sampling and generative over sampling.

For random under sampling, the set was separated by class, and a random sample of 50,000 was selected from the portion labeled as legitimate transactions. This sample was recombined with the fraudulent transactions to form a complete, resampled training set. Reducing the number of legitimate transactions to below 50,000 begins to cause large decreases in precision, and so smaller values were not used.

The generative oversampling selected only the fraudulent transactions from the original training set, and computed estimated distributions for all features. Fraudulent transactions were then generated by randomly sampling from these estimates, and class labels were added. This synthesized sample was recombined with the legitimate transactions. Four sets were generated using this method: 500, 1,000, 2,000, and 5,000 generated transactions.

For all models, including NB, performance was measured using precision, recall, and F1. For the final models, precision-recall area under curve (PR-AUC) was also measured. Accuracy was recorded, but because of the imbalance in the data, this value has no importance: predicting everything as being legitimate transactions will get only a fraction of one percent of the predictions wrong, giving a minimum accuracy greater than 99%.

To get a baseline for comparison against the NN and RF models, a Gaussian Naive Bayes classifier was chosen. Using both the raw data (all features) and only the features selected above, NB models were fitted to the original and resampled training sets, and scored on the validation set. The scores are tabulated in table 2 below.

Table 2: The scores achieved in the baseline Naive Bayes models. The first two rows use no resampling, with the only difference being without and with feature selection. Row rand_under_50K was from the undersampled set, and the remaining rows were from generative resampling. Note the sharp decrease in precision from generative 1K to generative 5K.

Accuracy | Precision | Recall | F1 | |
---|---|---|---|---|

raw | 0.996357 | 0.210526 | 0.410256 | 0.278260 |

median | 0.998947 | 0.670455 | 0.756410 | 0.710843 |

rand_under_50K | 0.998903 | 0.642857 | 0.807692 | 0.715909 |

gen_over_500 | 0.998903 | 0.642857 | 0.807692 | 0.715909 |

gen_over_1K | 0.998903 | 0.640000 | 0.820513 | 0.719101 |

gen_over_5K | 0.998420 | 0.523810 | 0.846154 | 0.647058 |

The table above shows that the chosen feature selection method greatly improves performance over the raw data for the NB classifier, and the generative resampling method works well for smaller resampling sizes, but loses precision as the number of samples generated increases. The generative resampling with 1,000 additional samples was chosen as the resampling method to be tested with the NN and RF models.

To select the best models among NN and RF classifiers, the algorithms for each were run numerous times — with different hyper-parameters — on four training sets (raw, feature selected, and both of these resampled), and evaluated using the validation set. This was done with automated processes described in the next few paragraphs, where the details of modeling are discussed for each classifier.

The program used for constructing, training, and evaluating neural networks can be found on my GitHub. Three network architectures were tested on the raw set: one with no hidden layer, and two with a single hidden layer of 14 and 29 nodes. Two architectures on the feature-selected training set without resampling: one with no hidden layer, and one with a single hidden layer of 4 nodes. The output layer for each was a softmax layer with two nodes that output probabilities for each class.

These models were evaluated using an automated hyper-parameter (hypers) search algorithm built into the program, using a random search approach. This iteratively trained on 150 combinations of hypers for each model, at 40 epochs per combination. Results from printouts and plots showing epoch scores were used to choose final model architecture.

The two best performers used a single hidden layer with with 14 nodes (no feature selection), both resampled (gen 1K) and original sample. The resampled set gave slightly better recall scores on the validation set (0.778 vs. 0.741) at the expense of precision (0.764 vs 0.842), what appears to be an overall reduction in performance for the gen 1K set. However, the no-resample model achieved an approximate 0.03 bump in precision after what appeared to be some over-training. Because of this and a slight preference for higher recall, the final model used the gen 1K resampled data.

The final NN model used the concatenation of the resampled training set and validation set. Performance was tracked against this combined data for fine-tuning, and not the test data (see table 3); scoring predictions on the test set were saved until after tuning.

The scikit-learn random forest classifier was used for these models, using the standard Gini index. Several random forests were evaluated for each training set (no-resample, and gen 1K), using different numbers of estimators (forest size). The models were evaluated by computing PR_AUC. The best model resulted from the no-resmaple set, using 1000 estimators. The difference in scores between the two sets was marginal: a PR-AUC using 1,000 estimators of 0.70 vs. 0.69.

The final RF model was trained on the concatenated version of the original training and validation sets, and scored using precision, recall, and F1. See Results for details.

Because the predicted probabilities from different machine learning models are not always true to real-life values, something that is important to have for risk minimization, predicted values should be analyzed and likely calibrated.

Analysis of the probabilities had two parts: comparing the mean predicted value from each model to the proportion of positives in the test set, and visual inspection of reliability curves. The former is done simply by computing the mean of the predicted values from each model, while the latter is beyond the scope of this report. The comparisons can be found in table 3 and figures 4 and 5.

Predicted probabilities were calibrated using Platt’s scaling, which trains a logistic regression (LR) model on probabilities predicted from the training set. Test set predictions from the models of interest (NN and RF) were then used as inputs to the LR models, producing outputs as calibrated probabilities. Outputs from neural networks are typically well calibrated on their own, but this particular dataset poses some challenges in this regard (see Limitations). Mean predicted values (MPV) and reliability diagrams from before and after calibration are shown below. For the reliability diagrams, only 5 bins were used because of the limited number of positives.

Table 3: Mean predicted values before and after calibration for NN and RF models.

NN before | NN after | RF before | RF after | Truth | |
---|---|---|---|---|---|

MPV | 0.00344 | 0.00215 | 0.00185 | 0.00164 | 0.00165 |

The RF shows a notable improvement in the diagrams and MPV. The NN almost appears worse, but the large dip in the diagram is due to a lack of predicted probabilities in that range; the shape is otherwise better aligned with the diagonal, and there was a marked improvement in MPV.

While the trained classifiers can do a good job of correctly predicting fraudulent transactions, they are not perfect. Instead of making a decision based solely on the class with the highest predicted probability, risk minimization uses Bayesian decision making — involving conditional probabilities and the losses associated with possible decisions to minimize total loss. This can be described more completely and efficiently with a mathematical definition of risk and decision making:

$$R(\omega_i | x) = \sum_{j=0}^c\lambda(\omega_i | \omega_j)P(\omega_j | x)$$Where \(\lambda(\omega_i | \omega_j)\) is the loss incurred from deciding \(\omega_i\) when the true state is \(\omega_j\). More compacty, \(\lambda(\omega_i | \omega_j)\) will be referred to as \(\lambda_{ij}\).

Now, in the case of predicting fraud, we have two class conditional risk: $$R(\omega_0 | x) = \lambda_{00}P(\omega_0 | x) + \lambda_{01}P(\omega_1 | x)$$ $$R(\omega_1 | x) = \lambda_{10}P(\omega_0 | x) + \lambda_{11}P(\omega_1 | x)$$ Decision making: if \(R(\omega_0 | x) < R(\omega_1 | x)\) choose \(\omega_0\), else choose \(\omega_1\).

The conxtual definitions for \(\lambda_{ij}\) are:

\(\lambda_{00}\) = Zero ($): the cost of deciding the transaction is legitimated when it is.

\(\lambda_{01}\) = Amount of transaction ($): the cost of deciding a transaction is legitimate when it is not.

\(\lambda_{10}\) = *C _{in}* ($): the cost of deciding the transaction is fraudulent when it is not.

\(\lambda_{11}\) =

The value *C _{in}* is some internal (administrative) cost associated with acting on the decision that a transaction is fraudulent. Because the internal cost is both unknown, and may not always be the same, the first goal of cost analysis was to compare the losses of the top two models as

This analysis was conducted in multiple parts:

- Using risk minimization
- with calibration
- without calibration

- Using only model predictions
- with calibration only

In pure predictive performance, the RF outperformed the NN when using 1000 estimators. The test scores and are tabulated in table 4, and the PR curves with AUC can be seen in figure 5:

Figure 5: Precision-Recall curves for the final NN and RF models. AUC for each is listed in the legend.

Table 4: Uncalibrated scores for NN and RF models using the test set.

Test scores | Precision | Recall | F1 |
---|---|---|---|

NN | 0.712 | 0.894 | 0.792 |

RF | 0.933 | 0.894 | 0.913 |

While the two models can achieve the same level of recall, there is a large difference in precision, meaning the NN is generating many false-positives in order to achieve high recall. The performance difference is also reflected in the AUC values. The PR-AUC was unchanged after calibrating probabilities.

Two figures are presented for evaluation of the performance of the calibrated models: Figure 6 shows total loss in dollars (left y-axis), and the decision truth ratio (right y-axis) as functions of *C _{in}*. The decision/truth ratio is the number of transactions decided to be fraudulent, divided by the number of actual fraudulent transactions — this is a simplified way of knowing how the minimum risk is making decisions as

A comparison of losses incurred using risk minimization from calibrated vs. uncalibrated probabilities are presented in figures 8 and 9 (below). Making minimum risk decisions without probability calibrations resulted in greater losses and and many more false positives for the range of *C _{in}* plotted.

A fraud detection system that uses Bayesian minimum risk can lead to significant savings when compared to using model predictions alone, and it is important to have properly calibrated probabilities when doing so. Both the NN and RF models appear to have very similar loss for most *C _{in}* values when using this method, but the RF has slightly lower loss for much of the range. Regardless, choosing a model should be determined after knowing the precise value of

The dataset was relatively small with very few instances of fraud, which means the performance of models could have high variance when predicting new data. The small number of positive instances also imposes some limitations on probability calibrations, possibly hindering the performance of risk minimization.