Data mining, p.24
Data Mining, page 24
1. Attribute-Value Description. The data to be analyzed must be in a flat-file form—all information about one object or example must be expressible in terms of a fixed collection of properties or attributes. Each attribute may have either discrete or numeric values, but the attributes used to describe samples must not vary from one case to another. This restriction rules out domains in which samples have an inherently variable structure.
2. Predefined Classes. The categories to which samples are to be assigned must have been established beforehand. In the terminology of machine learning this is supervised learning.
3. Discrete Classes. The classes must be sharply delineated: A case either does or does not belong to a particular class. It is expected that there will be far more samples than classes.
4. Sufficient Data. Inductive generalization given in the form of a decision tree proceeds by identifying patterns in data. The approach is valid if enough number of robust patterns can be distinguished from chance coincidences. As this differentiation usually depends on statistical tests, there must be sufficient number of samples to allow these tests to be effective. The amount of data required is affected by factors such as the number of properties and classes and the complexity of the classification model. As these factors increase, more data will be needed to construct a reliable model.
5. “Logical” Classification Models. These methods construct only such classifiers that can be expressed as decision trees or decision rules. These forms essentially restrict the description of a class to a logical expression whose primitives are statements about the values of particular attributes. Some applications require weighted attributes or their arithmetic combinations for a reliable description of classes. In these situations logical models become very complex and, in general, they are not effective.
6.2 C4.5 ALGORITHM: GENERATING A DECISION TREE
The most important part of the C4.5 algorithm is the process of generating an initial decision tree from the set of training samples. As a result, the algorithm generates a classifier in the form of a decision tree: a structure with two types of nodes—a leaf indicating a class, or a decision node specifying some tests to be carried out on a single-attribute value, with one branch and a subtree for each possible outcome of the test.
A decision tree can be used to classify a new sample by starting at the root of the tree and moving through it until a leaf is encountered. At each non-leaf decision node, the features’ outcome for the test at the node is determined and attention shifts to the root of the selected subtree. For example, if the classification model of the problem is given with the decision tree in Figure 6.3a, and the sample for classification in Figure 6.3b, then the algorithm will create the path through the nodes A, C, and F (leaf node) until it makes the final classification decision: CLASS2.
Figure 6.3. Classification of a new sample based on the decision-tree model. (a) Decision tree; (b) An example for classification.
The skeleton of the C4.5 algorithm is based on Hunt’s Concept Learning System (CLS) method for constructing a decision tree from a set T of training samples. Let the classes be denoted as {C1, C2, … , Ck}. There are three possibilities for the content of the set T:
1. T contains one or more samples, all belonging to a single class Cj. The decision tree for T is a leaf-identifying class Cj.
2. T contains no samples. The decision tree is again a leaf but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. The C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node.
3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading toward a single-class collection of samples. Based on single attribute, an appropriate test that has one or more mutually exclusive outcomes {O1, O2, … , On} is chosen. T is partitioned into subsets T1, T2, … , Tn, where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome (examples of this type of nodes are nodes A, B, and C in the decision tree in Fig. 6.3a).
The same tree-building procedure is applied recursively to each subset of training samples, so that the ith branch leads to the decision tree constructed from the subset Ti of the training samples. The successive division of the set of training samples proceeds until all the subsets consist of samples belonging to a single class.
The tree-building process is not uniquely defined. For different tests, even for a different order of their application, different trees will be generated. Ideally, we would like to choose a test at each stage of sample-set splitting so that the final tree is small. Since we are looking for a compact decision tree that is consistent with the training set, why not explore all possible trees and select the simplest? Unfortunately, the problem of finding the smallest decision tree consistent with a training data set is NP-complete. Enumeration and analysis of all possible trees will cause a combinatorial explosion for any real-world problem. For example, for a small database with five attributes and only 20 training examples, the possible number of decision trees is greater than 106, depending on the number of different values for every attribute. Therefore, most decision tree-construction methods are non-backtracking, greedy algorithms. Once a test has been selected using some heuristics to maximize the measure of progress and the current set of training cases has been partitioned, the consequences of alternative choices are not explored. The measure of progress is a local measure, and the gain criterion for a test selection is based on the information available for a given step of data splitting.
Suppose we have the task of selecting a possible test with n outcomes (n values for a given feature) that partitions the set T of training samples into subsets T1, T2, … , Tn. The only information available for guidance is the distribution of classes in T and its subsets Ti. If S is any set of samples, let freq(Ci, S) stand for the number of samples in S that belong to class Ci (out of k possible classes), and let |S| denote the number of samples in the set S.
The original ID3 algorithm used a criterion called gain to select the attribute to be tested that is based on the information theory concept: entropy. The following relation gives the computation of the entropy of the set T (bits are units):
Now consider a similar measurement after T has been partitioned in accordance with n outcomes of one attribute test X. The expected information requirement can be found as the weighted sum of entropies over the subsets:
The quantity
measures the information that is gained by partitioning T in accordance with the test X. The gain criterion selects a test X to maximize Gain(X), that is, this criterion will select an attribute with the highest information gain.
Let us analyze the application of these measures and the creation of a decision tree for one simple example. Suppose that the database T is given in a flat form in which each of 14 examples (cases) is described by three input attributes and belongs to one of two given classes: CLASS1 or CLASS2. The database is given in tabular form in Table 6.1.
TABLE 6.1. A Simple Flat Database of Examples for Training
Nine samples belong to CLASS1 and five samples to CLASS2, so the entropy before splitting is
After using Attribute1 to divide the initial set of samples T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by:
The information gained by this test x1 is
If the test and splitting is based on Attribute3 (test x2 represents the selection one of two values True or False), a similar computation will give new results:
and corresponding gain is
Based on the gain criterion, the decision-tree algorithm will select test x1 as an initial test for splitting the database T because this gain is higher. To find the optimal test it will be necessary to analyze a test on Attribute2, which is a numeric feature with continuous values. In general, C4.5 contains mechanisms for proposing three types of tests:
1. The “standard” test on a discrete attribute, with one outcome and one branch for each possible value of that attribute (in our example these are both tests x1 for Attribute1 and x2 for Attribute3).
2. If attribute Y has continuous numeric values, a binary test with outcomes Y ≤ Z and Y > Z could be defined by comparing its value against a threshold value Z.
3. A more complex test is also based on a discrete attribute, in which the possible values are allocated to a variable number of groups with one outcome and branch for each group.
While we have already explained standard test for categorical attributes, additional explanations are necessary about a procedure for establishing tests on attributes with numeric values. It might seem that tests on continuous attributes would be difficult to formulate, since they contain an arbitrary threshold for splitting all values into two intervals. But there is an algorithm for the computation of optimal threshold value Z. The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2, … , vm}. Any threshold value lying between vi and vi+1 will have the same effect as dividing the cases into those whose value of the attribute Y lies in {v1, v2, … , vi} and those whose value is in {vi+1, vi+2, … , vm}. There are thus only m-1 possible splits on Y, all of which should be examined systematically to obtain an optimal split. It is usual to choose the midpoint of each interval, (vi + vi+1)/2, as the representative threshold. The algorithm C4.5 differs in choosing as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself. This ensures that the threshold values appearing in either the final decision tree or rules or both actually occur in the database.
To illustrate this threshold-finding process, we could analyze, for our example of database T, the possibilities of Attribute2 splitting. After a sorting process, the set of values for Attribute2 is {65, 70, 75, 78, 80, 85, 90, 95, 96} and the set of potential threshold values Z is {65, 70, 75, 78, 80, 85, 90, 95}. Out of these eight values the optimal Z (with the highest information gain) should be selected. For our example, the optimal Z value is Z = 80 and the corresponding process of information-gain computation for the test x3 (Attribute2 ≤ 80 or Attribute2 > 80) is the following:
Now, if we compare the information gain for the three attributes in our example, we can see that Attribute1 still gives the highest gain of 0.246 bits and therefore this attribute will be selected for the first splitting in the construction of a decision tree. The root node will have the test for the values of Attribute1, and three branches will be created, one for each of the attribute values. This initial tree with the corresponding subsets of samples in the children nodes is represented in Figure 6.4.
Figure 6.4. Initial decision tree and subset cases for a database in Table 6.1.
After initial splitting, every child node has several samples from the database, and the entire process of test selection and optimization will be repeated for every child node. Because the child node for test x1, Attribute1 = B, has four cases and all of them are in CLASS1, this node will be the leaf node, and no additional tests are necessary for this branch of the tree.
For the remaining child node where we have five cases in subset T1, tests on the remaining attributes can be performed; an optimal test (with maximum information gain) will be test x4 with two alternatives: Attribute2 ≤ 70 or Attribute2 > 70.
Using Attribute2 to divide T1 into two subsets (test x4 represents the selection of one of two intervals), the resulting information is given by:
The information gained by this test is maximal:
and two branches will create the final leaf nodes because the subsets of cases in each of the branches belong to the same class.
A similar computation will be carried out for the third child of the root node. For the subset T3 of the database T, the selected optimal test x5 is the test on Attribute3 values. Branches of the tree, Attribute3 = True and Attribute3 = False, will create uniform subsets of cases that belong to the same class. The final decision tree for database T is represented in Figure 6.5.
Figure 6.5. A final decision tree for database T given in Table 6.1.
Alternatively, a decision tree can be presented in the form of an executable code (or pseudo-code) with if-then constructions for branching into a tree structure. The transformation of a decision tree from one representation to the other is very simple and straightforward. The final decision tree for our example is given in pseudocode in Figure 6.6.
Figure 6.6. A decision tree in the form of pseudocode for the database T given in Table 6.1.
While the gain criterion has had some good results in the construction of compact decision trees, it also has one serious deficiency: a strong bias in favor of tests with many outcomes. A solution was found in some kinds of normalization. By analogy with the definition of Info(S), an additional parameter was specified:
This represented the potential information generated by dividing set T into n subsets Ti. Now, a new gain measure could be defined:
This new gain measure expresses the proportion of information generated by the split that is useful, that is, that appears helpful in classification. The gain-ratio criterion also selects a test that maximizes the ratio given earlier. This criterion is robust and typically gives a consistently better choice of a test than the previous gain criterion. A computation of the gain-ratio test can be illustrated for our example. To find the gain-ratio measure for the test x1, an additional parameter Split-info(x1) is calculated:
A similar procedure should be performed for other tests in the decision tree. Instead of gain measure, the maximal gain ratio will be the criterion for attribute selection, along with a test to split samples into subsets. The final decision tree created using this new criterion for splitting a set of samples will be the most compact.
6.3 UNKNOWN ATTRIBUTE VALUES
The previous version of the C4.5 algorithm is based on the assumption that all values for all attributes are determined. But in a data set, often some attribute values for some samples can be missing—such incompleteness is typical in real-world applications. This might occur because the value is not relevant to a particular sample, or it was not recorded when the data were collected, or an error was made by the person entering data into a database. To solve the problem of missing values, there are two choices:
1. Discard all samples in a database with missing data.
2. Define a new algorithm or modify an existing algorithm that will work with missing data.
The first solution is simple but unacceptable when large amounts of missing values exist in a set of samples. To address the second alternative, several questions must be answered:
1. How does one compare two samples with different numbers of unknown values?
2. Training samples with unknown values cannot be associated with a particular value of the test, and so they cannot be assigned to any subsets of cases. How should these samples be treated in the partitioning?
3. In a testing phase of classification, how does one treat a missing value if the test is on the attribute with the missing value?
All these and many other questions arise with any attempt to find a solution for missing data. Several classification algorithms that work with missing data are usually based on filling in a missing value with the most probable value, or on looking at the probability distribution of all values for the given attribute. None of these approaches is uniformly superior.
In C4.5, it is an accepted principle that samples with unknown values are distributed probabilistically according to the relative frequency of known values. Let Info(T) and Infox(T) be calculated as before, except that only samples with known values of attributes are taken into account. Then the gain parameter can reasonably be corrected with a factor F, which represents the probability that a given attribute is known (F = number of samples in the database with a known value for a given attribute/total number of samples in a data set). The new gain criterion will have the form
Similarly, Split-info(x) can be altered by regarding the samples with unknown values as an additional group in splitting. If the test x has n outcomes, its Split-info(x) is computed as if the test divided the data set into n + 1 subsets. This modification has a direct influence on the final value of the modified criterion Gain-ratio(x).
Let us explain the modifications of the C4.5 decision-tree methodology applied on one example. The database is similar to the previous one (Table 6.1), only there is now one value missing for Attribute1 denoted by “?” as presented in Table 6.2.
TABLE 6.2. A Simple Flat Database of Examples with One Missing Value
The computation of the gain parameter for Attribute1 is similar to as before, only the missing value corrects some of the previous steps. Eight out of the 13 cases with values for Attribute1 belong to CLASS1 and five cases to CLASS2, so the entropy before splitting is
After using Attribute1 to divide T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by
The information gained by this test is now corrected with the factor F (F = 13/14 for our example):
The gain for this test is slightly lower than the previous value of 0.216 bits. The split information, however, is still determined from the entire training set and is larger, since there is an extra category for unknown values.
