The classification problem for US bank insurance business datahas imbalanced data distribution.
This means ratio between positive and negativeproportion are extremely unbalanced, the prediction models generated directlyby supervised learning algorithms like SVM, Logistic Regression are biased forlarge proportion. Example, the ratio between positive and negative classes is100:1. Therefore, this can be seen as such model does not help in prediction. Imbalanced classdistribution will affect the performance of classification problem.
Thus, sometechniques should be applied to deal this problem. One approach to handle theproblem of unbalanced class distribution is sampling techniques 2. This willrebalance the dataset. Sampling techniques are broadly classified into twotypes.
They are under sampling and over sampling. Under sampling technique isapplied to major class for reduction process (e.g. Random Under Sampling) andover sampling is another technique applied to add missing scores to set ofsamples of minor class (e.g.
Random Over Sampling).The drawback of ROS isredundancy in dataset this will again lead to classification problem that isclassifier may not recognize the minor class significantly. To overcome thisproblem, SMOTE (Synthetic Minority Over Sampling) is used. This will createadditional sample which are close and similar to near neighbors along withsamples of the minor class to rebalance the dataset with help of K-NearestNeighbors (KNN) 2.Sampling method is divided into non-heuristic method andheuristic method.
Non-heuristic will randomly remove the samples from majorityclass in order to reduce the degree of imbalance 10. Heuristic sampling isanother method which will distinguish samples based on nearest neighboralgorithm 7.Another difficulty in classification problem is data quality,which is existence of missing data. Frequent occurrence of missing data willgive biased result. Mostly, dataset attributes are dependent to each other.
Thus, identifying the correlation between those attributes can be used todiscover the missing data values. One approach to replace the missing valueswith some probable values is called imputation 6.One of the challenges in big data is data quality. We need toensure the quality of data otherwise it will mislead to wrong predictionssometimes. One significant problem of data quality is missing data.Imputation is method for handling the missing data. This willreconstruct the missing data with estimated ones. Imputation method hasadvantage of handling missing data without help of learning algorithms and thiswill also allow the researcher to select the suitable imputation method forparticular circumstance 3.
There are many imputation methods for missing value treatment(Some widely used data imputation methods are Case substitution, Mean and Modeimputation, Predictive model). In this paper we have built the predictive modelfor missing value treatment.There are a variety of machine learning algorithms to crackboth classification and regression problems. Machine learning is practice ofdesigning the classification which has capability to repeatedly learn and performwithout being explicitly programmed.
Machine learning algorithms are classifiedinto three types (Supervised learning, Unsupervised learning, Reinforcement Learning).In this paper, we propose supervised machinelearning algorithms to built the model. Some of the supervised learningalgorithms are listed below: Regression, DecisionTree, Random Forest, KNN, Logistic Regression etc 8.
Decision tree in machinelearning can be used for both classification and regression.In decision examination, a decision tree can be used to visually and unambiguouslyrepresent decision. The tree has two significant entities precisely knownas decision nodes and leaves. The leaves are the verdict or the final result.And the decision nodes are wherever the data is split. Classification tree istype of decision tree where the outcome was a variable like ‘fit’ or ‘unfit’.
Here the decision variable is Categorical.One of the best ensemble methods is random forest. It is usedfor both classification and regression 5. Random Forest is collection of manydecision trees; every tree has its full growth. And it has advantage ofautomatic feature selection and etc 4.Gradient Boosting looks to consecutively decrease fault witheach consecutive model, until one final model is produced. The key intend of everymachine learning algorithms is to construct the strongest predictive modelwhile accounting for computational effectiveness on top.
This is whereXGBoosting algorithm engages in recreation. XGBoost (eXtreme Gradient Boosting) is a direct application of GradientBoosting for decision trees. It gives you further regularize modelformalization to manage over-fitting, which gives improved performance 8.