Using false predictions to your benefit
Trigger--One of my family members had cancer some time ago and I was doing some research on that type of cancer and how people have been dealing with it, especially from a diagnostic perspective. I found there are many statistical and machine learning methods used to detect the type of that cancer, however, none of which are very accurate. One of the articles, I discuss below, caught my eyes though. The article mentioned that the false prediction of the model they designed was high, however, they did not leave the model, rather, they analyzed why and when the false prediction is high, which I found pretty insightful. This is what I am going to share in this post.
It is important to notice how each of these measures plays a role in how things may be impacted if the decisions from the model is used for decision making. To put it into perspective, consider a model that is used to detect if a tumor is cancerous or not given biopsy results:
Let's, for example, assume we want to determine what causes our predictions to be false positive. Also assume we have a dataset with variables X (m by n, m instances and n features) and response Y (m by 1, m instances), elements of Y are binary (0 or 1).
Introduction
It is pretty well-established that false predictions are not in favor of machine learning (any prediction) methods. I have faced many data science problems in my career fo which I could not design a high-quality model and feature set to solve them to a satisfactory level of performance. However, is there still any way to extract information even from a "bad" model? It turns out, there is. This discussion is focused on binary classification problems.
Confusion matrix
Let's start with the confusion matrix for a binary classification problem, the prediction is either "yes" or "no". A confusion matrix is the relationship between the instances for which the method says "yes" or "no" and being "correct" or "incorrect" about it. This forms 4 combinations, saying yes and being correct about it all the way to saying no and being incorrect about it. This is demonstrated in a 2 by 2 matrix, called the confusion matrix, which can be extended to many-class problems too. In a yes/no classification, the combinations are:
True positive: The number of instances for which the algorithm says "yes" and it is "correct" about it (it says "positive" and it is the "truth")
True negative: The number of instances for which the algorithm says "no" and it is "correct" about it (it says "no" and it is the "truth")
False positive: The number of instances for which the algorithm says "yes" and it is "incorrect" about it (it says "positive" and it is the "false" answer)
False negative: The number of instances for which the algorithm says "no" and it is "incorrect" about it (it says "no" and it is the "false" response)
Sometimes, for comparison purposes, these values are combined to form one single value to demonstrate the "performance" of the method under investigation (e.g., F1 score).
Taken from wikipedia
Let's have an example.
Examples of the confusion matrix and interpretation
Assume there is a binary classification problem with 100 instances in one class (class "yes") and 9900 instances in the other class (class "no"). Also, assume that multiple machine learning algorithm can model this problem with different "confusion matrices":
The first model: Says yes (1) to everything. Thus, this model is 100% correct when it says “yes”, 0% correct when it says “no”. Here is the confusion matrix for this model and for the given instances.
P | N | |
T | 100 | 0 |
F | 9900 | 0 |
The second model: Says no (0) to everything. Thus, this model is100% correct when it says “no”, 0% correct when it says “yes”.
P | N | |
T | 0 | 9900 |
F | 0 | 100 |
The third model: Has the following confusion matrix. Hence, 20% correct when it says “yes”, 91% correct when it says “no”. Here is the confusion matrix for this model and for the given instances.
P | N | |
T | 20 | 9000 |
F | 900 | 80 |
The fourth model: Has the following confusion matrix. Hence, 80% correct when it says “yes”, 80% correct when it says “no”. Here is the confusion matrix for this model and for the given instances.
P | N | |
T | 20 | 9000 |
F | 900 | 80 |
Notice how misleading it could be when the numbers in the matrix are considered rather than the percentages, especially when the data set is imbalanced.
- If true positive rate is high: When it says yes (the model believes the tumor is cancerous), I should take it seriously
- If the true negative rate is high: When it says no (the model believes the tumor is not cancerous), I can relax
- If the false positive rate is high: When it says yes (the model believes the tumor is cancerous), I still got a survival chance
- If the false negative rate is high: When it says no (the model believes the tumor cancerous), I still should be worried
For another example, a model is going to predict a “down_time” for equipment:
- With a high False Positive rate, we would lead to an unnecessary maintenance
- With a high False Negative rate, we would lead to a sudden, unplanned downtime
- With a high True Positive rate, we would lead to saving from sudden downtime
- With a high True Negative rate, we would lead to effective use of the system
How to use false predictions to our benefit
Now let's go back to the main topic, how to use false predictions of a model to add to our insights. False predictions takes place in a sub-space of the space of features. It is insightful if we can find that sub-space as we would then know (by inference) our prediction is going to be more likely false if the given test instance is in that subspace. In particular, one can determine in which subspace the predictions are more true (model can be trusted) and in which the predictions are more false (model cannot be trusted). This can be viewed as "causes" making our model incorrect.
Let's, for example, assume we want to determine what causes our predictions to be false positive. Also assume we have a dataset with variables X (m by n, m instances and n features) and response Y (m by 1, m instances), elements of Y are binary (0 or 1).
- Train our main model, model M, on <X, Y>
- Then test the model to find when it generates a false positive. This can be preferably done by a set of cross-validations, where the false positives are recorded on the test fold.
- Now create another response, Y' (m by 1), element i of Y' is 1 if the main model gives a false positive on that element, and 0 otherwise.
- Then fit an interpretable model, model M', to <X, Y'>
- Then find which variables are important in detecting Y' using M'. Those variables are the ones that cause M to make false-positive predictions.
M' can be used to predict whether M is going to generate a false positive or not. Note that this is the very essence of ensemble methods that use a classifier to chose the best classifier (stacked generalization, see also this).
The cancer paper, this paper, and this one used this technique to find the conditions under which their model generates a false prediction. The method for M', however, was a simple univariate statistical analysis in these articles because of its reliability. The articles found something very interesting (which actually made me a bit nervous about the outcome of the biopsy): The size of the tumor and the age of the patient are important factors for false-negative of the results. Thus, if the tumor is large or the patient is old then the outcome of the biopsy can be wrong (the tumor could be malignant while the model says it is benign). Very disturbing indeed! All good now though as that family member had surgery and everything cleared out.
Like always, comments are welcome.
By R.B.
Comments
Post a Comment