Data Mining (Mehmed Kantardzic) » p.31 » Global Archive Voiced Books Online Free

Data mining, p.31

Data Mining, page 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

Zurada, J. M., Introduction to Artificial Neural Systems, West Publishing Co., St. Paul, MN, 1992.

The book is one of the traditional textbooks on ANNs. The text grew out of a teaching effort in artificial neural systems offered for both electrical engineering and computer science majors. The author emphasizes that the practical significance of neural computation becomes apparent for large or very large-scale problems.

8

ENSEMBLE LEARNING

Chapter Objectives

Explain the basic characteristics of ensemble learning methodologies.

Distinguish between the different implementations of combination schemes for different learners.

Compare bagging and boosting approaches.

Introduce AdaBoost algorithm and its advantages.

One of the primary goals of data mining is to predict an “unknown” value of a new sample from observed samples. Such a prediction is achieved by two sequential phases as shown in Figure 8.1: (a) training phase—producing a predictive model from training samples using one of the available supervised learning algorithms; and (b) testing phase—evaluating the generated predictive model using test samples that are not used in the training phase. Numerous applications of a data-mining process showed validity of the so-called “No-Free-Lunch Theorem.” It states that there is no single learning algorithm that is the best and most accurate in all applications. Each algorithm determines a certain model that comes with a set of assumptions. Sometimes these assumptions hold, sometimes not; therefore, no single algorithm “wins” all the time.

Figure 8.1. Training phase and testing phase for a predictive model. (a) Training phase; (b) testing phase.

In order to improve the accuracy of a predictive model, the promising approach called the ensemble learning is introduced. The idea is to combine results from various predictive models generated using training samples. The key motivation behind the proposed approach is to reduce the error rate. An initial assumption is that it will become much more unlikely that the ensemble will misclassify a new sample compared with a single predictive model. When combinining multiple, independent, and diverse “decisions makers,” each of which is at least more accurate than random guessing, correct decisions should be reinforced. The idea may be demonstrated by some simple decision processes where single-human performances are compared with human ensembles. For example, given the question “How many jelly beans are in the jar?”, the group average will outperform individual estimates. Or, in the TV series “Who Wants to be a Millionaire?” where audience (ensemble) vote is a support for the candidate who is not sure of the answer.

This idea is proven theoretically by Hansen and company through the statement: If N classifiers make independent errors and they have the error probability e < 0.5, then it can be shown that the error of an ensemble E is monotonically decreasing the function of N. Clearly, performances quickly decrease for dependent classifiers.

8.1 ENSEMBLE-LEARNING METHODOLOGIES

The ensemble-learning methodology consists of two sequential phases: (a) the training phase, and (b) the testing phase. In the training phase, the ensemble method generates several different predictive models from training samples as presented in Figure 8.2a. For predicting an unknown value of a test sample, the ensemble method aggregates outputs of each predictive model (Fig. 8.2b). An integrated predictive model generated by an ensemble approach consists of several predictive models (Predictive model.1, Predictive model.2, … , Predictive model. n) and a combining rule as shown in Figure 8.2b. We will refer to such a predictive model as an ensemble. The field of ensemble learning is still relatively new, and several names are used as synonyms depending on which predictive task is performed, including combination of multiple classifiers, classifier fusion, mixture of experts, or consensus aggregation.

Figure 8.2. Training phase and testing phase for building an ensemble. (a) Training phase; (b) testing phase.

To perform better than a single predictive model, an ensemble should consist of predictive models that are independent of each other, that is, their errors are uncorrelated, and each of them has an accuracy rate of >0.5. The outcome of each predictive model is aggregated to determine the output value of a test sample. We may analyze all steps of ensemble prediction for a classification task. For example, we may analyze a classification task where the ensemble consists of 15 classifiers, each of which classifies test samples into one of two categorical values. The ensemble decides the categorical value based on the dominant frequency of classifiers’ outputs. If 15 predictive models are different from each other, and each model has the identical error rate (ε = 0.3), the ensemble will make a wrong prediction only if more than half of the predictive models misclassify a test sample. Therefore, the error rate of the ensemble is

which is considerably lower than the 0.3 error rate of a single classifier. The sum is starting with eight, and it means that eight or more models misclassified a test sample, while seven or fewer models classified the sample correctly.

Figure 8.3a shows the error rates of an ensemble, which consists of 15 predictive models (n = 15). The x-axis represents an error rate (ε) of a single classifier. The diagonal line represents the case in which all models in the ensemble are identical. The solid line represents error rates of an ensemble in which predictive models are different and independent from each other. An ensemble has a significantly lower error rate than a single predictive model only when the error rate (ε) of the members of the ensemble is lower than 0.5.

Figure 8.3. Changes in error rates of an ensemble. (a) Identical predictive models versus different predictive models in an ensemble; (b) the different number of predictive models in an ensemble.

We can also analyze the effect of the number of predictive models in an ensemble. Figure 8.3b shows error-rate curves for ensembles that consist of 5, 15, 25, and 35 predictive models, respectively. Observe that when an error rate of a predictive model is lower than 0.5, the larger the number of predictive models is, the lower the error rate of an ensemble is. For example, when each predictive model of an ensemble has an error rate of 0.4, error rates of each ensemble (n = 5, n = 15, n = 25, and n = 35) are calculated as 0.317, 0.213, 0.153, and 0.114, respectively. However, this decrease in the error rate for an ensemble is becoming less significant if the number of classifiers is very large, or when the error rate of each classifier becomes relatively small.

The basic questions in creating an ensemble learner are as follows: How to generate base learners, and how to combine the outputs from base learners? Diverse and independent learners can be generated by

(a) using different learning algorithms for different learning models such as support vector machines, decision trees, and neural networks;

(b) using different hyper-parameters in the same algorithm to tune different models (e.g., different numbers of hidden nodes in artificial neural networks);

(c) using different input representations, such as using different subsets of input features in a data set; or

(d) using different training subsets of input data to generate different models usually using the same learning methodology.

Stacked Generalization (or stacking) is a methodology that could be classified in the first group (a). Unlike other well-known techniques, stacking may be (and normally is) used to combine models of different types. One way of combining multiple models is specified by introducing the concept of a meta-learner. The learning procedure is as follows:

1. Split the training set into two disjoint sets.

2. Train several base learners on the first part.

3. Test the base learners on the second part.

4. Using the predictions from (3) as the inputs, and the correct responses as the outputs, train a higher level learner.

Note that steps (1) to (3) are the same as cross-validation, but instead of using a winner-take-all approach, the base learners are combined, possibly nonlinearly. Although an attractive idea, it is less theoretically analyzed and less widely used than bagging and boosting, the two most recognized ensemble-learning methodologies. Similar situation is with the second group of methodologies (b): Although a very simple approach, it is not used or analyzed intensively. Maybe the main reason is that applying the same methodology with different parameters does not guarantee independence of models.

Class (c) methodologies are based on manual or automatic feature selection/extraction that can be used for generating diverse classifiers using different feature sets. For example, subsets related to different sensors, or subsets of features computed with different algorithms, may be used. To form training data sets, different subsets of input features are chosen, and then each training sample with the selected input features becomes an element of training data sets. In Figure 8.4, there are five training samples {S1, S2, S3, S4, S5} with four features {F1, F2, F3, F4}. When the training data set 1 is generated, three features {F1, F2, F4} is randomly selected from input features {F1, F2, F3, F4}, and all training samples with those features form the first training set. Similar process is performed for the other training sets. The main requirement is that classifiers use different subsets of features that are complementary.

Figure 8.4. Feature selection for ensemble classifiers methodology.

The random subspace method (RSM) is a relatively recent method of ensemble learning that is based on the theory of stochastic discrimination. Learning machines are trained on randomly chosen subspaces of the original input space and the outputs of the models are then combined. Illustrative example for the classification of movies is given in Figure 8.5. RSM works well for large feature sets with redundant features. Random forest methodology, which utilizes such an approach, is implemented in many commercial data-mining tools.

Figure 8.5. RSM approach in ensemble classifier for movie classification.

Methodologies based on different training subsets of input samples (d) are the most popular approaches in ensemble learning, and corresponding techniques such as bagging and boosting are widely applied in different tools. But, before the detailed explanations of these techniques, it is necessary to explain one additional and final step in ensemble learning, and that is combining of outcomes for different learners.

8.2 COMBINATION SCHEMES FOR MULTIPLE LEARNERS

Combination schemes include:

1. Global approach is through learners’ fusion where all learners produce an output and these outputs are combined by voting, averaging, or stacking. This represents integration (fusion) functions where for each pattern, all the classifiers contribute to the final decision.

2. Local approach is based on learner selection where one or more learners responsible for generating the output are selected based on their closeness to the sample. Selection function is applied where for each pattern, just one classifier, or a subset, is responsible for the final decision.

3. Multistage combination uses a serial approach where the next learner is trained with or tested on only instances where previous learners were inaccurate.

Voting is the simplest way of combining classifiers on a global level, and representing the result as a linear combination of outputs dj for n learners:

The result of the combination could be different depending on wj. Alternatives for combinations are simple sum (equal weights), weighted sum, median, minimum, maximum, and product of dij. Voting schemes can be seen as approximations under a Bayesian framework where weights wj approximate prior model probabilities.

Rank-level Fusion Method is applied for some classifiers that provide class “scores,” or some sort of class probabilities. In general, if Ω = {c1, … , ck} is the set of classes, each of these classifiers can provide an “ordered” (ranked) list of class labels. For example, if probabilities of output classes are 0.10, 0.75, and 0.20, corresponding ranks for the classes will be 1, 3, and 2, respectively. The highest rank is given to the class with the highest probability. Let us check an example, where the number of classifiers is N = 3, and the number of classes k = 4, Ω = {a, b, c, d}. For a given sample, the ranked outputs of the three classifiers are as follows:

In this case, final selection of the output class will be determined by accumulation of scores for each class:

The winner class is b because it has the maximum overall rank.

Finally, the Dynamic Classifier Selection (DCS) algorithm, representing a local approach, assumes the following steps:

1. Find the k nearest training samples to the test input.

2. Look at the accuracies of the base classifiers on these samples.

3. Choose one (or top N) classifiers that performs best on these samples.

4. Combine decisions for selected classifiers.

8.3 BAGGING AND BOOSTING

Bagging and boosting are well-known procedures with solid theoretical background. They belong to the class (d) of ensemble methodologies and essentially they are based on resampling of a training data set.

Bagging, a name derived from bootstrap aggregation, was the first effective method of ensemble learning and is one of the simplest methods. It was originally designed for classification and is usually applied to decision tree models, but it can be used with any type of model for classification or regression. The method uses multiple versions of a training set by using the bootstrap, that is, sampling with replacement. Each of these data sets is used to train a different model. The outputs of the models are combined by averaging (in the case of regression) or voting (in the case of classification) to create a single output.

In the bagging methodology a training data set for a predictive model consists of samples taken with replacement from an initial set of samples according to a sampling distribution. The sampling distribution determines how likely it is that a sample will be selected. For example, when the sampling distribution is predefined as the uniform distribution, all N training samples have the same probability, 1/N, of being selected. In the same training data set, because of replacement sampling, some training samples may appear multiple times, while any training samples may not appear even once. In Figure 8.6, there are five training samples {S1, S2, S3, S4, S5} with four features {F1, F2, F3, F4}. Suppose that three training data sets are formed by samples that are randomly selected with replacement from the training samples according to the uniform distribution. Each training sample has a 1/5 probability of being selected as an element of a training data set. In the training data set 1, S2 and S4 appear twice, while S1 and S3 do not appear.

Figure 8.6. Bagging methodology distributes samples taken with replacement from initial set of samples.

Bagging is only effective when using unstable nonlinear models where small changes in training data lead to significantly different classifiers and large changes in accuracy. It decreases error by decreasing the variance in the results of unstable learners.

Boosting is the most widely used ensemble method and one of the most powerful learning ideas introduced in the ensemble-learning community. Originally designed for classification, it can also be extended to regression. The algorithm first creates a “weak” classifier, that is, it suffices that its accuracy on the training set is slightly better than random guessing. Samples are given initial weights, and usually it starts with uniform weighting. For the following iterations, the samples are reweighted to focus the system on samples that are not correctly classified with a recently learned classifier. During each step of learning: (1) increase weights of the samples that are not correctly learned by the weak learner, and (2) decrease weights of the samples that are correctly learned by the weak learner. Final classification is based on a weighted vote of weak classifiers generated in iterations.

8.4 ADABOOST

The original boosting algorithm combined three weak learners to generate a strong, high quality learner. AdaBoost, short for “adaptive boosting,” is the most popular boosting algorithm. AdaBoost combine “weak” learners into a highly accurate classifier to solve difficult highly nonlinear problems. Instead of sampling, as in a bagging approach, AdaBoost reweighs samples. It uses the same training set over and over again (thus it need not be large) and it may keep adding weak learners until a target training error is reached.

Given a training data set: {(x1, y1), … , (xm, ym)} where xi ∈ X and yi ∈ {−1, +1}, when a weak classifier is trained with the data, for each input sample xi the classifier will give classification h(xi) (where h(xi) ∈ {−1, +1}). With these assumptions the main steps of the AdaBoost algorithm are presented in Figure 8.8.

Simplicity and easy implementation are the main reasons why AdaBoost is very popular. It can be combined with any classifiers including neural networks, decision trees, or nearest neighbor classifiers. The algorithm requires almost no parameters to tune, and is still very effective even for the most complex classification problems, but at the same time it could be sensitive to noise and outliers.

Ensemble-learning approach showed all advantages in one very famous application, Netflix $1 million competition. The Netflix prize required substantial improvement in the accuracy of predictions on how much someone is going to love a movie based on his or her previous movie preferences. Users’ rating for movies was 1 to 5 stars; therefore, the problem was classification task with five classes. Most of the top-ranked competitors have used some variations of ensemble learning, showing its advantages in practice. Top competitor BellKor team explains ideas behind its success: “Our final solution consists of blending 107 individual predictors. Predictive accuracy is substantially improved when blending multiple predictors. Our experience is that most efforts should be concentrated in deriving substantially different approaches, rather than refining a single technique. Consequently, our solution is an ensemble of many methods.”

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

Data Mining, page 31

Other author's books: