A dataset from UC Irving with attributes describing breast mass cell nuclei characteristics was used to classify masses as benign or malignant. The dataset is a sample of 569 observations, with 357 benign and 212 malignant diagnoses. Many features in this set are highly correlated.
Analysis: A Gaussian Naive Bayes (NB) classifier was used to make diagnoses on two versions of the data: untransformed and PCA-whitened (for decorrelation). For the former, a random forest classifier with 300 estimators was used to select features in by importance and an iterative ten-fold CV scoring process was used to determine the optimum number of top features. The latter set was scored similarly, but the top features were selected by their eigenvalues. The final model was fitted to 67% of the data, and the remaining 33% was used as the test set. Scores evaluated were accuracy, precision, recall, and F1.
Results and Conclusions: The untransformed set performed better than the PCA set, after averaging many ten-fold cross validations (0.912 and 0.863 F1 scores, respectively). The final model had 92.3% accuracy and an F1 score of 0.909 on the test data.
Using features without any transformation is more successful than PCA whitening, despite high correlation between features. The most successful features for making diagnoses with NB were concave points_worst, radius_worst, area_worst, perimeter_worst, and concave points_mean.
Limitations: Only Naive Bayes was used in this project, and it is possible that other classification methods have better diagnostic accuracy.
A great deal of research has gone into breast cancer, and for good reason: it is a major plight, with approximately 12% of women in the U.S. developing invasive breast cancer in their lifetimes . The purpose of this project was to evaluate the efficacy of Naive Bayes classification for diagnosing malignant breast cancer, and to determine what cell characteristics are most useful for producing diagnoses.
The questions asked going into this study were:
UC Irving hosts a variety of data sets, including one of breast cancer diagnostic data — the set used in this project. The dataset has 569 observations, with 357 benign and 212 malignant diagnoses. There are 30 attributes to be used as predictors that describe breast mass cell nuclei characteristics (continuous numerical values) that were computed from digitized images of fine needle aspirations. Along with these predictors, there are values for patient ID and a diagnosis (benign or malignant). Many features in this set are highly correlated, which is typically not ideal for a classifier like NB, where statistical independence between features is a foundational assumption.
There are ten base metrics used for the predictors, with three variations of each (mean, standard error, and worst):
In order to get a feel for the data, and how to proceed, kernel density estimates were plotted for the two diagnosis values. This was done to determine if the data was roughly normally distributed, and to gain some initial insight into which features were more likely to make good predictors. The resulting plots are in the figure below:
The distributions all appear reasonably normal, despite some skewness. Upon visual inspection, the most promising base features were radius, perimeter, area, concavity and concave points. To make a final decision for feature importances in the raw data, a random forest classifier was used, and the results are shown in descending order in the following chart:
These rankings were used as a part of feature selection, the details of which are discussed in the section on modeling.
Since feature names suggested correlations were likely high (e.g., radius and area), a correlation matrix was computed to check the extent of correlations before modeling. As expected, they were large for many of the features (see figure 2). Because features are ideally independent when modeling with NB, a transformed version of the dataset was created by decorrelating and normalizing the features through PCA whitening (principal component analysis). For modeling, training sets were transformed and the resulting eigenvectors and eigenvalues were used to PCA whiten the test data. The correlation matrices before and after decorrelation are displayed as heat maps in figure 3:
As opposed to feature selection using importances, component selection with the decorrelated data was performed by choosing those with greater eigenvalues. The eigenvalues were computed using NumPy singular value decomposition, which returns values in descending order. Since principal components have no particular meaning in terms of the original features, results will not be presented as they were in figure 2.
A Gaussian Naive Bayes classifier (found here) was used to model both sets, using forward selection of the top ten features. The average of 100 iterations of k-fold cross validation (k=10) for each step of forward selection was recorded. The 100 iterations of k-fold CV were not strictly necessary, but because NB is computationally inexpensive, this was done to reduce the variance of estimated performance. The resulting scores are shown in figures 4 and 5 in the next section, and they show that the raw data gave superior performance when compared to the decorrelated set.
The final step to the modeling process was to fit 67% of the raw data, and score the remaining 33%. These results will also be shown in the next section.
For the raw data, modeling the top five features gave the best results, and the decorrelated data performed best when using only the component with the greatest eigenvalue.
The raw data shows peaks for all metrics, except for recall (peaks at top two), when using the top five features. For this classification problem, recall is a metric of high importance, but the decline from it’s peak is very small (delta = -0.0015) compared to the gain in precision (delta = 0.026).
For the decorrelated set, the interpretation is not as clear since there is a sizable gain in recall when using the top nine or ten components, though the drop in precision is significant. While it is dangerous to produce false negative diagnoses, it can also be costly (and frightening to patients) to produce false positives, where the latter would occur nearly 21% of the time with the top ten components, compared to 4.5% when using only the most significant component. This presents a more difficult choice for choosing components for modeling.
Modeling the raw data gives superior results. To that end, the final model used the top five features according to estimated importance.
The final model was fitted to the top 5 features of 67% of the raw data, and scored on the remaining 33%. The results are tabulated below:
The scores in table 1 above are promising for such a simple model. While certainly not perfect, other classifiers could likely achieve better results. The scores can be interpreted as follows:
The performance achieved with raw data and a simple Naive Bayes model was reasonably high, though it is likely that other statistical models or machine learning techniques could improve on these results, and the prediction threshold could also be adjusted to increase the sensitivity (recall).
Only one type of statistical model was used to produce diagnoses, and the feature selection process was limited to only what a random forest classifier determined were the most important features. Model ensembles and domain knowledge could help improve the modeling process.