The classification problem for US bank insurance business data
has imbalanced data distribution. This means ratio between positive and negative
proportion are extremely unbalanced, the prediction models generated directly
by supervised learning algorithms like SVM, Logistic Regression are biased for
large proportion. Example, the ratio between positive and negative classes is
100:1. Therefore, this can be seen as such model does not help in prediction.
distribution will affect the performance of classification problem. Thus, some
techniques should be applied to deal this problem. One approach to handle the
problem of unbalanced class distribution is sampling techniques 2. This will
rebalance the dataset. Sampling techniques are broadly classified into two
types. They are under sampling and over sampling. Under sampling technique is
applied to major class for reduction process (e.g. Random Under Sampling) and
over sampling is another technique applied to add missing scores to set of
samples of minor class (e.g. Random Over Sampling).The drawback of ROS is
redundancy in dataset this will again lead to classification problem that is
classifier may not recognize the minor class significantly. To overcome this
problem, SMOTE (Synthetic Minority Over Sampling) is used. This will create
additional sample which are close and similar to near neighbors along with
samples of the minor class to rebalance the dataset with help of K-Nearest
Neighbors (KNN) 2.Sampling method is divided into non-heuristic method and
heuristic method. Non-heuristic will randomly remove the samples from majority
class in order to reduce the degree of imbalance 10. Heuristic sampling is
another method which will distinguish samples based on nearest neighbor
algorithm 7.Another difficulty in classification problem is data quality,
which is existence of missing data. Frequent occurrence of missing data will
give biased result. Mostly, dataset attributes are dependent to each other.
Thus, identifying the correlation between those attributes can be used to
discover the missing data values. One approach to replace the missing values
with some probable values is called imputation 6.
One of the challenges in big data is data quality. We need to
ensure the quality of data otherwise it will mislead to wrong predictions
sometimes. One significant problem of data quality is missing data.
Imputation is method for handling the missing data. This will
reconstruct the missing data with estimated ones. Imputation method has
advantage of handling missing data without help of learning algorithms and this
will also allow the researcher to select the suitable imputation method for
particular circumstance 3.
There are many imputation methods for missing value treatment
(Some widely used data imputation methods are Case substitution, Mean and Mode
imputation, Predictive model). In this paper we have built the predictive model
for missing value treatment.
There are a variety of machine learning algorithms to crack
both classification and regression problems. Machine learning is practice of
designing the classification which has capability to repeatedly learn and perform
without being explicitly programmed. Machine learning algorithms are classified
into three types (Supervised learning, Unsupervised learning, Reinforcement Learning).In this paper, we propose supervised machine
learning algorithms to built the model. Some of the supervised learning
algorithms are listed below: Regression, Decision
Tree, Random Forest, KNN, Logistic Regression etc 8. Decision tree in machine
learning can be used for both classification and regression.
In decision examination, a decision tree can be used to visually and unambiguously
represent decision. The tree has two significant entities precisely known
as decision nodes and leaves. The leaves are the verdict or the final result.
And the decision nodes are wherever the data is split. Classification tree is
type of decision tree where the outcome was a variable like ‘fit’ or ‘unfit’.
Here the decision variable is Categorical.
One of the best ensemble methods is random forest. It is used
for both classification and regression 5. Random Forest is collection of many
decision trees; every tree has its full growth. And it has advantage of
automatic feature selection and etc 4.
Gradient Boosting looks to consecutively decrease fault with
each consecutive model, until one final model is produced. The key intend of every
machine learning algorithms is to construct the strongest predictive model
while accounting for computational effectiveness on top. This is where
XGBoosting algorithm engages in recreation.
XGBoost (eXtreme Gradient Boosting) is a direct application of Gradient
Boosting for decision trees. It gives you further regularize model
formalization to manage over-fitting, which gives improved performance 8.