Algorithms Applied:

Artificial Neural Network:

Basically artificial neural networks are formed by the interconnections of the processing elements called nodes. Artificial neural networks are also called as CONNECTIONISTS. The strength of the connections is defined by the weights of the connections formed. In artificial neural networks the solution to any problem can be solved by changing the weights. The main purpose of using artificial neural networks is to find the hidden patterns present in the large dataset. It is highly tolerant to noise, it can adapt itself to any situation and it is also a robust supervised classification technique.

Reason For Applying Artificial Neural Network:

Since the data set with which I am dealing is of very high dimension and artificial neural network would be appropriate to find the hidden patterns and adapt itself to the dataset and perform well in training the model accordingly. The accuracy produced by applying the artificial neural network is nearly 63%. In the further steps I have performed a few steps to increase the performance of the model so as to increase the accuracy.

The dataset sent into the artificial neural network initially was having a class imbalance in the target variable. After solving the problem of class imbalance by importing the imblearn package and using the smote function the accuracy of the model has been increased to 76% which is reasonably good.

Furthermore, I have tried to change the parameters in the artificial neural network classifier by increasing the number of hidden layers but the accuracy of the model was not being improved accordingly.

General applications of artificial neural networks:

Artificial neural networks have been successfully used in data classification by training and testing the model as seen above and to prediction of weather report and finally data association.

Gaussian Naïve Bayes:

Generally Gaussian Naïve Bayes is defined as the model which works with the probabilistic classifiers. It applies bayes theorem with an assumption that all the attributes in the dataset are completely independent of each other. If the word Gaussian is mentioned that means that the algorithm assumes that the classes follow a normal distribution. The main advantage of Gaussian Naïve Bayes algorithm is, it is simple to understand and build. This model can work on large datasets and can also be used to deal with complex real world situations. It is considered as a highly sophisticated method. The Gaussian naïve bayes algorithm is mainly performed in three steps. Step1: Convert the dataset into frequency table. Step2: Creating the likelihood table by finding the shared probabilities and match probabilities. Step 3: Find the posterior probability of each class by applying Naive bayes theorem and finally, it concludes that the class with high posterior probability as the final predictor.

Reason For Applying Gaussian Naïve Bayes:

The main reason behind choosing Gaussian naïve bayes algorithm is because it can deal with large datasets and after splitting the dataset into training and testing, Naïve bayes works really fast on the training data and predicts which class has higher posterior probability. Naïve bayes is not sensitive to irrelevant features in the dataset. By applying naïve bayes to this dataset, the resulted accuracy is relatively good which is 87%. Furthermore, I have tried to increase the performance of the model by the following way which I have discussed below.

In the previous step I tried to find the accuracy of the model without solving the class imbalance problem of the target variable. After solving the class imbalance problem using the inbuilt function I tried to perform Naïve Bayes on the dataset it did not affect the accuracy of the dataset. Finally, the accuracy of the dataset by applying the Gaussian naïve bayes is 87%.

General applications of Gaussian Naïve Bayes:

Gaussian Naïve bayes has a huge application scope because of its fast processing of the dataset. It can be used in text classification considering the word count frequencies, spam detection and sentiment analysis.

Logistic Regression:

Logistic regression is considered as an example of discriminative classifier. Logistic regression is a simple linear classifier that deals with binary classification. The main parameters in logistic function are log odds and logit score and beta value. The log odds are defined as the values which are formed by finding the ratio of probability of occurrence of the event to the probability of not occurrence of the event. The second parameter called logit score is calculated as follows: If the dataset is provided we consider the target variable and also the important attributes and find the x score of these attributes and these score are called as logit score. These logit scores are sent as an input to the black function called sigmoid function and the output of a sigmoid function is the probability of two outcomes. Generally the sigmoid function has values ranging from 1 and 0. But it can never have an accurate value of 1 or 0 but ranges between these values. The third parameter called beta can be calculated from the training data by using maximum likelihood estimation. Basically logistic regression uses regularization factor in order to prevent the over fitting of the model. The regularization factor adds a penalty term to improve the quality of the output. Logistic functions can also work on non linear functions.

Reason For Applying Logistic Regression:

The main ideology in choosing logistic regression classification algorithm is because the target attribute of the dataset can be considered for binary classification. It is also robust and can work accurately on large datasets by calculating the logit scores and the accuracy is calculated by the average of the logit scores. The resulted accuracy of logistic regression on the dataset is 65%. In order to increase the accuracy of the model I have solved the problem of class imbalance and passed the dataset into the model to run. But there is no noticeable increase in the accuracy of the model. The accuracy remains the same of 65%.

General Applications of Logistic Regression:

Logistic regression is applicable for image segmentation, hand writing recognition, image processing and prediction.

Random Forest:

Random Forest is an advanced version of the decision trees; it is also called as random decision forest. Random forest algorithm is flexible to categorical and numerical values, it is also robust towards noise and outliers. The main advantage of Random Forest is if we increase the number of trees the overall accuracy of the model can also be improved. Basically it follows the boot strapping algorithm in which it first builds a cart model using specific samples and attributes. Random Forest prefers boot strapping algorithm because it works faster than bagging and boosting. Generally the error in tree depends on the strength of the tree and also the correlation that exists between them. Random Forests runs efficiently on large datasets and produces highly accurate results, so over fitting of a model can never occur. Random Forests are best because it can handle with thousands of inputs without deleting them, it also provides an estimation clearly defining which variables are important in the given dataset for prediction.

Reason For Applying Random Forests:

The main reason in choosing random forest is by considering that it has freedom to choose any number of records and attributes to make the final prediction. Additionally the computational speed of the random forest algorithm is very low. By applying Random Forest algorithm on this dataset an accuracy of 83% is produced. I have tried to increase the accuracy by increasing the number of iterations which did not show me any noticeable change. Furthermore, I have also tried to increase the accuracy by making the unbalanced target attribute to balanced class attribute, even then the model did not show any noticeable increase in the accuracy.

General applications of Random Forests:

Random forests acts as an important tool in predictions. Its process can be fully parallelizable. It uses proximity to generate tree-based clusters. It has an automatic predictor selection in the process.

K Nearest Neighbors:

K nearest neighbor is defined as the algorithm which used for prediction. If we have an unseen data and we need a prediction for that data then we go for k nearest neighbors. The KNN algorithm searches for the k most important and similar instances and results a prediction. This prediction will help in the predicting the unseen data. In order to determine the k points we perform Euclidean distance measure between the points if it is numerical and consider hamming distance if the data points are categorical and binary values. It is considered as non parametric because it does not follow any functional form and KNN does not make any assumptions about the dataset. KNN is referred as instance-based and competitive learning algorithm, because there occur a competition between the data instances to be a part of prediction. KNN is also referred as lazy model because it does not provide the output until the model needs prediction.

Reason For Applying K Nearest Neighbors:

The main purpose of choosing KNN on this dataset is to use it for prediction along with classification, it can also be used for reression but classification is preferred over regression. It is good at its prediction because it observes the existing data points and tries to predict where the new data point falls into. I have used the function called KNeighbor Classifier () imported from the package called sklearn.neighbors. We can also pass a parameter called n_neighbors which helps in considering the number of the data points that the algorithm should take into consideration. In the implementation I have given a value 1 to the n_neighbors the accuracy produced was 84%. As I have tried in increasing the accuracy of the model I solved the problem of class imbalance and passed into the dataset. Unfortunately, there is a noticeable change in the accuracy which is decreased from 84% to 80%. I have also tried to increase the accuracy of model by increasing the n_neighbors value but that did not show any significant increase in the results. An additional reason to choose K nearest Neighbors because it works efficiently on the large datasets and very robust towards the noisy data present in the dataset.

General Applications of K Nearest Neighbors:

Since the K Nearest Neighbor algorithm can be used for both classification and regression, it has wide variety of applications; it can be used in text mining and agriculture and most importantly in finance for prediction of stock price and credit rating and loan management.