Machine learning (ML) algorithms have had a considerable impact on the technology industry over the past several years. Their applications continue to increase within diverse industry verticals including robotics, healthcare, fraud detection, personalized marketing, recommendation engines, and autonomous vehicles, among others. While the innovation potential of this technology is undisputed, the wide variety of emergent applications raises new challenges and brings up important considerations. One of those challenges arises in the process of teaching an algorithm through an unbalanced set of data.
Unbalanced datasets are those with a disproportionate class or target variable ratio between observations. It is common to find this kind of dataset in cases examining fraudulent vs. non-fraudulent bank transactions, malignant vs. benign tumors, or predicting whether a product is defective within a production run.
Why is unbalanced data a problem?
To answer this question it is important to start by defining a guideline to rate the performance of a machine learning algorithm in a given task. To an inexperienced eye, a good performance criterion of our algorithm could be accuracy; that is, the percentage of observations that our algorithm manages to classify in the appropriate class. However, this criterion may be inappropriate when we are faced with an unbalanced case of data. Assume a fictitious data set of 1000 mammogram observations made to people in a community through an early cancer screening campaign. In this fictitious data, we found 10 persons with malignant tumors and the remaining 990 are healthy. A machine learning algorithm is trained from this data set with the objective of analyzing observations obtained from a neighboring community in which there are 12 persons with malignant tumors and 988 healthy. If our goal is to implement a machine learning algorithm capable of learning from the data in the first community to predict cases in the second, and if our algorithm is not properly trained, it could make the mistake of learning only through the Majority Class (in this case, observations of healthy people) classifying the total data set as healthy. Using the accuracy as performance criteria the results would take the following form:
- Correct predictions: 988/1000 = 98.8%
- Incorrect predictions: 12/1000 = 1.2%
These results could lead us to the erroneous conclusion that we are using a good algorithm since it has an accuracy of 98.8%. On second glance, however, it is easy to notice that the algorithm is deeply flawed since what it does is generalize the dominant class and fails to properly identify the truly important class, which is the 12 cases with malignant tumors. A similar scenario can be observed in real-time with the ongoing Coronavirus outbreak. While cases of infection remain a small percentage of the overall population, looking at the data this way does not tell an accurate story of the virus’s underlying dangers and is certainly not helpful in determining how to address it. Facing this challenge is important to identify other criteria to rate the performance of our algorithm. In order to identify and handle this scenario adequately, there’s a variety of metrics we can turn to that will more accurately rate and provide insight into the performance of our algorithm, which we’ll go over in the next section.
Performance Metrics
A very useful criterion for measuring performance in a binary classification task is the confusion matrix, which allows us to easily compare the number of predicted observations in each class with respect to the actual values of these observations. Figure 1 shows how a confusion matrix is obtained from the real value of our observation present in the test set, and the predictions for those observations obtained from our algorithm.
Figure 1: Confusion Matrix for Binary Classification Task
From the confusion matrix it is possible to obtain other useful metrics among which are:
- Precision: Predicted positive cases classified correctly = tp/(tp+fp)
- PCC (Percent Correct Classification): Overall accuracy = (tp+tn)/(tp+tn+fp+fn)
- False Alarm Rate (FA): Actual negative classified as positive (Type l Error Rate) = fp/(tn+fp)
- False Dismissal Rate (FD): Actual positive classified as negative (Type ll Error Rate) = fn/(tp+fn)
- Specificity: Actual negative cases classified correctly (True Negative Rate) = tn/(tn+fp)
- Recall/Sensitivity : Actual positive cases classified correctly (true positive rate) = 1 .. FD
Let’s examine some examples of how to handle cases of unbalanced data.
1) Random Resampling Techniques: This approach aims to balance classes in the training data (data preprocessing) before providing the data as input to the machine learning algorithm. The main objective of balancing classes is to either increase the frequency of the minority class or decrease the frequency of the majority class. This is done in order to obtain approximately the same number of instances for both the classes.
Advantages
- Easy to implement
- Does not lead to a significant increase in resources or execution time
Disadvantages
- Could lead to a loss of information or overfitting
2) Clustering the majority class: This approach aims to obtain a sub-sample of the majority class but instead of relying on random samples to cover the variety of training samples, it implies clustering the abundant class into “n groups”, with n being the number of cases in the minority class. For each group, only the centroid (center of cluster) is kept. The model is then trained with the rare class and the centroids only.
Advantages
- Unlike random under-sampling, this method allows for keeping most of the information from original samples intact
- It allows the user to perfectly balance the majority and minority classes without overfitting the model
Disadvantages
- This is a more complex technique where a clustering model is performed in the data preprocessing stage
- Run time and computational resources increase since a more complex clustering algorithm is implemented
3) Informed Over Sampling: Synthetic Minority Oversampling Technique (SMOTE): This is an over-sampling technique where non-random samples of the minority class are replicated, but new synthetic samples of the minority class are created and added to the original dataset. These synthetic observations are generated by interpolation of the features of the minority class.
This approach aims to mitigate the overfitting effect which occurs when exact replicas of minority instances are added to the main dataset. SMOTE method could be implemented easily imbalanced-learn library which is a powerful python library for handling imbalanced data.
Advantages
- It doesn’t lose useful information, as common in some under-sampling techniques
- Mitigates the problem of overfitting caused by random oversampling as synthetic examples are generated rather than rely on replication of instances
Disadvantages
- SMOTE is not very effective for high dimensional data
- While generating synthetic examples, SMOTE does not take into consideration neighboring examples from other classes. This can result in an increased overlapping of classes and can introduce additional noise.
Conclusion
- There are easy and powerful methods available for balancing unbalanced data when training machine learning algorithms
- When evaluating model performance it’s extremely important to define the appropriate criteria according to the application
- There is a wide range of techniques for handling unbalanced data, and it is extremely important to choose the appropriate one according to the circumstances of each case
At Growth Acceleration Partners, we have extensive expertise in many verticals. We can provide your organization with resources in the following areas:
- Software development for cloud and mobile applications
- Data analytics and data science
- Information systems
- Machine learning and artificial intelligence
- Predictive modeling
- QA Automation
If you have any further questions regarding our services, please reach out to us.
REFERENCES
[1] Imbalanced-learn https://imbalancedlearn.readthedocs.io/en/stable/over_sampling.html
About Emanuel Hernández Castillo
Emanuel Hernández Castillo is a Data Scientist at Growth Acceleration Partners. He is passionate about robotics and reinforcement learning, highly experienced in supervised/unsupervised learning, modeling techniques, and developing machine learning algorithms using Python. Outside of work he enjoys traveling, hiking and spending time in nature. You can connect with Emanuel on LinkedIn or by sending him an email.