The classification problem for US bank insurance business data

has imbalanced data distribution. This means ratio between positive and negative

proportion are extremely unbalanced, the prediction models generated directly

by supervised learning algorithms like SVM, Logistic Regression are biased for

large proportion. Example, the ratio between positive and negative classes is

100:1. Therefore, this can be seen as such model does not help in prediction.

Imbalanced class

distribution will affect the performance of classification problem. Thus, some

techniques should be applied to deal this problem. One approach to handle the

problem of unbalanced class distribution is sampling techniques 2. This will

rebalance the dataset. Sampling techniques are broadly classified into two

types. They are under sampling and over sampling. Under sampling technique is

applied to major class for reduction process (e.g. Random Under Sampling) and

over sampling is another technique applied to add missing scores to set of

samples of minor class (e.g. Random Over Sampling).The drawback of ROS is

redundancy in dataset this will again lead to classification problem that is

classifier may not recognize the minor class significantly. To overcome this

problem, SMOTE (Synthetic Minority Over Sampling) is used. This will create

additional sample which are close and similar to near neighbors along with

samples of the minor class to rebalance the dataset with help of K-Nearest

Neighbors (KNN) 2.Sampling method is divided into non-heuristic method and

heuristic method. Non-heuristic will randomly remove the samples from majority

class in order to reduce the degree of imbalance 10. Heuristic sampling is

another method which will distinguish samples based on nearest neighbor

algorithm 7.Another difficulty in classification problem is data quality,

which is existence of missing data. Frequent occurrence of missing data will

give biased result. Mostly, dataset attributes are dependent to each other.

Thus, identifying the correlation between those attributes can be used to

discover the missing data values. One approach to replace the missing values

with some probable values is called imputation 6.

One of the challenges in big data is data quality. We need to

ensure the quality of data otherwise it will mislead to wrong predictions

sometimes. One significant problem of data quality is missing data.

Imputation is method for handling the missing data. This will

reconstruct the missing data with estimated ones. Imputation method has

advantage of handling missing data without help of learning algorithms and this

will also allow the researcher to select the suitable imputation method for

particular circumstance 3.

There are many imputation methods for missing value treatment

(Some widely used data imputation methods are Case substitution, Mean and Mode

imputation, Predictive model). In this paper we have built the predictive model

for missing value treatment.

There are a variety of machine learning algorithms to crack

both classification and regression problems. Machine learning is practice of

designing the classification which has capability to repeatedly learn and perform

without being explicitly programmed. Machine learning algorithms are classified

into three types (Supervised learning, Unsupervised learning, Reinforcement Learning).In this paper, we propose supervised machine

learning algorithms to built the model. Some of the supervised learning

algorithms are listed below: Regression, Decision

Tree, Random Forest, KNN, Logistic Regression etc 8. Decision tree in machine

learning can be used for both classification and regression.

In decision examination, a decision tree can be used to visually and unambiguously

represent decision. The tree has two significant entities precisely known

as decision nodes and leaves. The leaves are the verdict or the final result.

And the decision nodes are wherever the data is split. Classification tree is

type of decision tree where the outcome was a variable like ‘fit’ or ‘unfit’.

Here the decision variable is Categorical.

One of the best ensemble methods is random forest. It is used

for both classification and regression 5. Random Forest is collection of many

decision trees; every tree has its full growth. And it has advantage of

automatic feature selection and etc 4.

Gradient Boosting looks to consecutively decrease fault with

each consecutive model, until one final model is produced. The key intend of every

machine learning algorithms is to construct the strongest predictive model

while accounting for computational effectiveness on top. This is where

XGBoosting algorithm engages in recreation.

XGBoost (eXtreme Gradient Boosting) is a direct application of Gradient

Boosting for decision trees. It gives you further regularize model

formalization to manage over-fitting, which gives improved performance 8.