Abstract Data mining is a progression of finding needed information from the large amount of data which is retrieves different databases. Classification is one of the techniques in data mining which is a function that assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target class for each case in the data. Decision tree (DT) algorithm is based on classification technique which is powerful and popular tool. This algorithm represents some rules, which can be understood by humans and used in knowledge system such as database. DT is represented by hierarchical tree structure. It is used in many different disciplines including diagnosis, cognitive science, artificial intelligence, game theory, engineering and data mining. This paper is deliberate to study about Decision Tree learning Algorithms. Keywords : Hierarchical tree, Decision Tree algorithm, Classification I.INTRODUCTION Data mining is a type of categorizationtechnique which is actually used to take out unknown patterns from bulkydatabases.
Data mining uses two types of approaches i.e supervised learning orunsupervised learning. Classification is based on supervised learning approach.Classification methods aspire to recognize the classes that belongs objectsfrom some expressive qualities. It finds convenience in a wide range of humanactivities and predominantly in automated decision making.
II.DECISION TREE Decision trees are a simple, but powerfulform of multiple variable analysis. DT is a flow chart like tree structure.
DTis especially attractive in a data mining environment. DT can handle highdimensional data and having ability to break down a complex decision makingprocess in to collection of simpler decision, which is providing solution toeasy interpret. The reasons are:1. Intuitiverepresentation, the resulting model is easy to assimilate by humans2. DTdoes not require any parameter setting from the user and suited for exploratoryknowledge discovery.3. DTcan be constructed relatively fast and the accuracy.
DT easily derives the rulescorresponding to the tree by traversing each leaf of the tree starting from thenode. It denoted that many different leaves of the tree refer to the same classlabels, but each leaf refers to a different rule. Each node denotes asfollowing:-Ø Internal node :A test on an attribute.
Ø Branch node :Outcome of the test.Ø Leaf node :Class distributions.Ø Arc/edge :Split of one attribute.Ø Path :A disjunction of test to make the final decision. Rootnode Internal node Warm Cold Non-mammals Leaf node Non-mammals Mammals Yes No Fig 1: Decision TreeDT Features are:-Ø Produceresults that communicate very well in symbolic and visual terms and the abilityto incorporate multiple predictors in a simple step-by-step fashion. Ø Incorporatevarious levels of measurement, including qualitative (e.g., good-bad) andquantitative measures.
Ø Adaptto various twists and turns in data – unbalanced effects, nested effects,offsetting effects, interactions and nonlinearities – that frequently defeatother one-way and multi-way statistical and numeric approaches.Ø Nonparametricand highly robust and produce similar effects regardless of the level ofmeasurement of the fields that are used to construct decision tree branches.Decision Tree Parameters:1.
Minimalsize for split: Thesize of a node is the number of Examples in its subset. Only those nodes aresplit whose size is greater than or equal to the minimal size for split parameter. Thisrange is in -1 to +? and type must be an integer. Default size is 4.2. MinimalLeaf Size: The sizeof a leaf is the number of Examples in its subset.
The tree is generated insuch a way that every leaf has at least the minimal leaf size number of Examples. Thisrange is in -1 to +? and type must be an integer. Default size is 2.
3. MinimalGain: The gain of anode is calculated before splitting it. The node is split if its gain isgreater than the minimal gain. A higher value of minimal gain results in fewer splits and thus a smallertree. A value that is too high will completely prevent splitting and a treewith a single node is generated.
This range is in 0.0 to +? and typemust be real. Default size is 0.14. Maximaldepth of Tree: Thedepth of a tree varies depending upon the size and characteristics of the ExampleSet. This parameter is used to restrict the depth of the decision tree. If itsvalue is set to ‘-1’, the maximal depthparameter puts no bound on the depth of the tree. In this case the tree isbuilt until other stopping criteria are met.
If its value is set to ‘1’, a treewith a single node is generated. Default size is 20.5. DecisionCriterion: Selects the criterion on which Attributes will be selected for splitting.
It can have one of the following values: Ø Information_gain: The entropies of all the Attributes are calculated and the one with least entropy is selected for split. This method has a bias towards selecting Attributes with a large number of values. Ø Gain_ratio: A variant of information gain that adjusts the information gain for each Attribute to allow the breadth and uniformity of the Attribute values. Ø Gini_index: A measure of inequality between the distributions of label characteristics. Splitting on a chosen Attribute results in a reduction in the average gini index of the resulting subsets. Ø Accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the whole tree. The construction of the decision treeinvolves the following three main phases:-Construction Phase: The initialdecision tree is constructed in this phased, based on the entire trainingdata-set. It requires recursively partitioning the training set into two ormore, sub-partition using a splitting criterion, until a stopping criteria ismet.
Pruning Phase: The tree constructed inthe previous phase may not result in the best possible set of rules due to over-fitting.The pruning phase removes some of the lower branches and nodes to improve itsperformance.Processing the Pruned tree: To improveunderstandability.III.
DECISION TREE ALGORITHMA decision tree(DT) model is a computational model consisting of three parts:1) A decision treeis defined.2) An algorithm tocreate the tree.3) An algorithmthat applies the tree to data and solves the problem under consideration.Algorithm:Input: D //DecisionTreeOutput: O// ModelPredictionDT Proc algorithm:Simplest algorithmto illustrate prediction technique using DT.For each t € D don = root node of D;While n not leafnode doObtain answer toquestion on n applied to t;Indentify arc fromt, which contains correct answer;n = node at end ofthis arc;Make predictionfor t based on label of n;Types of Decision tree algorithms:-Ø Hunt’sØ CARTØ SPRINTØ J48Ø C4.5Ø Random ForestIV. ADVANTAGES AND DISADVANTAGES OF DECISIONTREE S.No Advantages Disadvantages 1 Self explanatory and easy in terms of readability and who is not familiar with data mining before can interpret a decision tree if it is of small size.
Big majority of the inducers require the target value of the data instances to be discrete valued only. 2 Both categorical and numerical or continuous attributes can be handled. Cannot represent the same subtrees as one so it will replicate the trees since every path is mutually exclusive 3 The data may contain missing attribute values and decision trees will still handle it.
Decision trees are greedy, they sometimes become overly sensitive and tend to choose a split due to a noise or an irrelevant attribute which provides a wrong split and affect the accuracy poorly. 4 The data might have error and decision trees will still handle it and they are robust to noise. Not sufficient to make a statistically significant decision for the class representation of the nodes which is also known as the data fragmentation problem. 5 Constructing models that require large sized training datasets to be easier, computationally inexpensive (in comparison with other machine learning algorithms) and faster Handling the missing data require a lot of computation which is a drawback by means of computational time Table 1: Advantages and Disadvantages V. EVALUATION PERFORMANCE OF DECISION TREECONFUSION MATRIX A confusion matrix is a table thatis often used to describe the performance of a decision tree model (or”classifier”) on a set of test data for which the true values areknown. The rightdiagonal elements TP (true positive) and TN (true negative) correctly classifyInstances as well as FP (false positive) and FN (false negative) incorrectlyclassify Instances. Actual Vs Predicted Predicted Survived Not Survived Actual Survived True Positive (TP) False Negative (FN) Not Survived False Positive (FP) True Negative (TN) Table 2: Actual VsPredicted Confusion MatrixTotal number ofinstances = Correctly classified instance + Incorrectly classified instance Correctlyclassified instance = TP + TNIncorrectlyclassified instance = FP + FNPRECISION Precision is the ratio of modulescorrectly classified to the number of entire modules classified fault-prone.
Itis proportion of units correctly predicted as faulty. Precision=TP/TP+ FPRECALL Recall is the ratio of modulescorrectly classified as fault-prone to the number of entire faulty modules. Recall=TP/TP+ FNCOST MATRIX A cost matrix is similar toconfusion matrix but minor difference is with finding the value of costaccuracy through misclassification error rate.
Misclassificationerror rate = 1 – AccuracyF-MEASURE FM is a combination of recall andprecision. It is also defined as harmonic mean of precision and recall.F-Measure=2*Recall*Precision/(Recall+Precision)ACCURACY It is defined as the ratio ofcorrectly classified instances to total number of instances.Accuracy=TP+TN/(TP+FP+TN+FN)CALCULATE VALUE TPR, TNR, FPR and FNR For good classifiers, TPR and TNRboth should be nearer to 100%.
Similar is the case with precision and accuracyparameters. On the contrary, FPR and FNR both should be as close to 0% aspossible. Calculate the value of true positive rate, true negative rate, falsepositive rate and false negative rate by methods shown below:- TPR=TP/ (TP+FN)TNR=TN/(FP+TN)FPR=FN/(TP+FN)FNR=FP/(FP+TN) VI. CONCLUSION This paper presented about decisiontree which is highly efficient tools of machine learning and data mining,capable to produce accurate and easy-to-understand models. This paper providedsome fundamental characteristics about decision tree. According to this paper, performanceevaluation is strongly based on decision criterion such as information gain,gini-index and gain-ratio. DT can handle large size of training datasets and divideit to smaller form of data.
Finally it is concluded, DT is a readability treeform structure who is not familiar with classification techniques before caneasily understand about the results of this tree. In future, we will implementin DT with dataset and will check the performance evaluation.