Data mining, p.11

Data Mining, page 11

 

Data Mining
Select Voice:
Brian (uk)
Emma (uk)  
Amy (uk)
Eric (us)
Ivy (us)
Joey (us)
Salli (us)  
Justin (us)
Jennifer (us)  
Kimberly (us)  
Kendra (us)
Russell (au)
Nicole (au)



Larger Font   Reset Font Size   Smaller Font  

  What are the results of a feature-reduction process, and why do we need this process for every specific application? The purposes vary, depending upon the problem on hand, but, generally, we want

  1. to improve performances of the model-generation process and the resulting model itself (typical criteria are speed of learning, predictive accuracy, and simplicity of the model);

  2. to reduce dimensionality of the model without reduction of its quality through

  (a) elimination of irrelevant features,

  (b) detection and elimination of redundant data and features,

  (c) identification of highly correlated features, and

  (d) extraction of independent features that determine the model; and

  3. to help the user visualize alternative results, which have fewer dimensions, to improve decision making.

  3.3 RELIEF ALGORITHM

  Reliefis a feature weight-based algorithm for feature selection inspired by the so-called instance-based learning. It relies on relevance evaluation of each feature given in a training data set, where samples are labeled (classification problems). The main idea of Relief is to compute a ranking score for every feature indicating how well this feature separates neighboring samples. The authors of the Relief algorithm, Kira and Rendell, proved that the ranking score is large for relevant features and small for irrelevant ones.

  The core of the Relief algorithm is to estimate the quality of features according to how well their values distinguish between samples close to each other. Given training data S, the algorithm randomly selects subset of samples size m, where m is a user-defined parameter. Relief analyzes each feature based on a selected subset of samples. For each randomly selected sample X from a training data set, Relief searches for its two nearest neighbors: one from the same class, called nearest hit H, and the other one from a different class, called nearest miss M. An example for two-dimensional data is given in Figure 3.2.

  Figure 3.2. Determining nearest hit H and nearest miss M samples.

  The Relief algorithm updates the quality score W(Ai) for all feature Ai depending on the differences on their values for samples X, M, and H:

  The process is repeated m times for randomly selected samples from the training data set and the scores W(Ai) are accumulated for each sample. Finally, using threshold of relevancy τ, the algorithm detects those features that are statistically relevant to the target classification, and these are the features with W(Ai) ≥ τ. We assume the scale of every feature is either nominal (including Boolean) or numerical (integer or real). The main steps of the Relief algorithm may be formalized as follows:

  Initialize: W(Aj) = 0; i = 1, … , p (p is the number of features)

  For i = 1 to m

  Randomly select sample X from training data set S.

  Find nearest hit H and nearest miss M samples.

  For j = 1 to p

  End.

  End.

  Output: Subset of feature where W(Aj) ≥ τ

  For example, if the available training set is given in Table 3.2 with three features (the last one of them is classification decision) and four samples, the scoring values W for the features F1 and F2 may be calculated using Relief:

  TABLE 3.2. Training Data Set for Applying Relief Algorithm

  In this example the number of samples is low and therefore we use all samples (m = n) to estimate the feature scores. Based on the previous results feature F1 is much more relevant in classification than feature F2. If we assign the threshold value of τ = 5, it is possible to eliminate feature F2 and build the classification model only based on feature F1.

  Relief is a simple algorithm that relies entirely on a statistical methodology. The algorithm employs few heuristics, and therefore it is efficient—its computational complexity is O(mpn). With m, number of randomly selected training samples, being a user-defined constant, the time complexity becomes O(pn), which makes it very scalable to data sets with both a huge number of samples n and a very high dimensionality p. When n is very large, it is assumed that m n. It is one of the few algorithms that can evaluate features for real-world problems with large feature space and large number of samples. Relief is also noise-tolerant and is unaffected by feature interaction, and this is especially important for hard data-mining applications. However, Relief does not help with removing redundant features. As long as features are deemed relevant to the class concept, they will be selected even though many of them are highly correlated to each other.

  One of the Relief problems is to pick a proper value of τ. Theoretically, using the so-called Cebysev’s inequality τ may be estimated:

  While the above formula determines τ in terms of α (data-mining model accuracy) and m (the training data set size), experiments show that the score levels display clear contrast between relevant and irrelevant features so τ can easily be determined by inspection.

  Relief was extended to deal with multi-class problems, noise, redundant, and missing data. Recently, additional feature-selection methods based on feature weighting are proposed including ReliefF, Simba, and I-Relief, and they are improvements of the basic Relief algorithm.

  3.4 ENTROPY MEASURE FOR RANKING FEATURES

  A method for unsupervised feature selection or ranking based on entropy measure is a relatively simple technique, but with a large number of features its complexity increases significantly. The basic assumption is that all samples are given as vectors of a feature’s values without any classification of output samples. The approach is based on the observation that removing an irrelevant feature, a redundant feature, or both from a set may not change the basic characteristics of the data set. The idea is to remove as many features as possible and yet maintain the level of distinction between the samples in the data set as if no features had been removed. The algorithm is based on a similarity measure S that is in inverse proportion to the distance D between two n-dimensional samples. The distance measure D is small for close samples (close to 0) and large for distinct pairs (close to one). When the features are numeric, the similarity measure S of two samples can be defined as

  where Dij is the distance between samples xi and xj and α is a parameter mathematically expressed as

  D is the average distance among samples in the data set. Hence, α is determined by the data. But, in a successfully implemented practical application, it was used a constant value of α = 0.5. Normalized Euclidean distance measure is used to calculate the distance Dij between two samples xi and xj:

  where n is the number of dimensions and maxk and mink are maximum and minimum values used for normalization of the kth dimension.

  All features are not numeric. The similarity for nominal variables is measured directly using Hamming distance:

  where |xik = xjk| is 1 if xik = xjk, and 0 otherwise. The total number of variables is equal to n. For mixed data, we can discretize numeric values and transform numeric features into nominal features before we apply this similarity measure. Figure 3.3a is an example of a simple data set with three categorical features; corresponding similarities are given in Figure 3.3b.

  Figure 3.3. A tabular representation of similarity measures S. (a) Data set with three features; (b) a table of similarity measures categorical feature Sij between samples.

  The distribution of all similarities (distances) for a given data set is a characteristic of the organization and order of data in an n-dimensional space. This organization may be more or less ordered. Changes in the level of order in a data set are the main criteria for inclusion or exclusion of a feature from the feature set; these changes may be measured by entropy.

  From information theory, we know that entropy is a global measure, and that it is less for ordered configurations and higher for disordered configurations. The proposed technique compares the entropy measure for a given data set before and after removal of a feature. If the two measures are close, then the reduced set of features will satisfactorily approximate the original set. For a data set of N samples, the entropy measure is

  where Sij is the similarity between samples xi and xj. This measure is computed in each of the iterations as a basis for deciding the ranking of features. We rank features by gradually removing the least important feature in maintaining the order in the configurations of data. The steps of the algorithm are based on sequential backward ranking, and they have been successfully tested on several real-world applications.

  1. Start with the initial full set of feature F.

  2. For each feature f∈F, remove one feature f from F and obtain a subset Ff. Find the difference between entropy for F and entropy for all Ff. In the example in Figure 3.2, we have to compare the differences (EF − E F-F1), (EF − E F-F2), and (EF − E F-F3).

  3. Let fk be a feature such that the difference between entropy for F and entropy for Ffk is minimum.

  4. Update the set of feature F = F − {fk}, where “–” is a difference operation on sets. In our example, if the difference (EF − E F-F1) is minimum, then the reduced set of features is {F2, F3}. F1 becomes the bottom of the ranked list.

  5. Repeat steps 2–4 until there is only one feature in F.

  A ranking process may be stopped in any iteration, and may be transformed into a process of selecting features, using the additional criterion mentioned in step 4. This criterion is that the difference between entropy for F and entropy for Ff should be less than the approved threshold value to reduce feature fk from set F. A computational complexity is the basic disadvantage of this algorithm, and its parallel implementation could overcome the problems of working with large data sets and large number of features sequentially.

  3.5 PCA

  The most popular statistical method for dimensionality reduction of a large data set is the Karhunen-Loeve (K-L) method, also called PCA. In various fields, it is also known as the singular value decomposition (SVD), the Hotelling transform, and the empirical orthogonal function (EOF) method. PCA is a method of transforming the initial data set represented by vector samples into a new set of vector samples with derived dimensions. The goal of this transformation is to concentrate the information about the differences between samples into a small number of dimensions. Practical applications confirmed that PCA is the best linear dimension-reduction technique in the mean-square error sense. Being based on the covariance matrix of the features, it is a second-order method. In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations of the original features with the largest variance. Since the variance depends on the scale of the variables, it is customary to first standardize each variable to have mean 0 and standard deviation 1. After the standardization, the original variables with possibly different units of measurement are all in comparable units.

  More formally, the basic idea can be described as follows: A set of n-dimensional vector samples X = {x1, x2, x3, … , xm} should be transformed into another set Y = {y1, y2, … , ym} of the same dimensionality, but y-s have the property that most of their information content is stored in the first few dimensions. This will allow us to reduce the data set to a smaller number of dimensions with low information loss.

  The transformation is based on the assumption that high information corresponds to high variance. So, if we want to reduce a set of input dimensions X to a single dimension Y, we should transform X into Y as a matrix computation

  choosing A such that Y has the largest variance possible for a given data set. The single dimension Y obtained in this transformation is called the first principal component. The first principal component is an axis in the direction of maximum variance. It minimizes the distance of the sum of squares between data points and their projections on the component axis, as shown in Figure 3.4 where a two-dimensional space is transformed into a one-dimensional space in which the data set has the highest variance.

  Figure 3.4. The first principal component is an axis in the direction of maximum variance.

  In practice, it is not possible to determine matrix A directly, and therefore we compute the covariance matrix S as a first step in feature transformation. Matrix S is defined as

  The eigenvalues of the covariance matrix S for the given data should be calculated in the next step. Finally, the m eigenvectors corresponding to the m largest eigenvalues of S define a linear transformation from the n-dimensional space to an m-dimensional space in which the features are uncorrelated. To specify the principal components we need the following additional explanations about the notation in matrix S:

  1. The eigenvalues of S n×n are λ1, λ2, … , λn, where

  2. The eigenvectors e1, e2, … , en correspond to eigenvalues λ1, λ2, … , λn, and they are called the principal axes.

  Principal axes are new, transformed axes of an n-dimensional space, where the new variables are uncorrelated, and variance for the ith component is equal to the ith eigenvalue. Because λi’s are sorted, most of the information about the data set is concentrated in a few first-principal components. The fundamental question is how many of the principal components are needed to get a good representation of the data? In other words, what is the effective dimensionality of the data set? The easiest way to answer the question is to analyze the proportion of variance. Dividing the sum of the first m eigenvalues by the sum of all the variances (all eigenvalues), we will get the measure for the quality of representation based on the first m principal components. The result is expressed as a percentage, and if, for example, the projection accounts for over 90% of the total variance, it is considered to be good. More formally, we can express this ratio in the following way. The criterion for feature selection is based on the ratio of the sum of the m largest eigenvalues of S to the trace of S. That is a fraction of the variance retained in the m-dimensional space. If the eigenvalues are labeled so that λ1 ≥ λ2 ≥ … ≥ λn, then the ratio can be written as

  When the ratio R is sufficiently large (greater than the threshold value), all analyses of the subset of m features represent a good initial estimate of the n-dimensionality space. This method is computationally inexpensive, but it requires characterizing data with the covariance matrix S.

  We will use one example from the literature to show the advantages of PCA. The initial data set is the well-known set of Iris data, available on the Internet for data-mining experimentation. This data set has four features, so every sample is a four-dimensional vector. The correlation matrix, calculated from the Iris data after normalization of all values, is given in Table 3.3.

  TABLE 3.3. The Correlation Matrix for Iris Data

  Based on the correlation matrix, it is a straightforward calculation of eigenvalues (in practice, usually, one of the standard statistical packages is used), and these final results for the Iris data are given in Table 3.4.

  TABLE 3.4. The Eigenvalues for Iris Data

  Feature Eigenvalue

  Feature 1 2.91082

  Feature 2 0.92122

  Feature 3 0.14735

  Feature 4 0.02061

  By setting a threshold value for R* = 0.95, we choose the first two features as the subset of features for further data-mining analysis, because

  For the Iris data, the first two principal components should be adequate description of the characteristics of the data set. The third and fourth components have small eigenvalues and therefore, they contain very little variation; their influence on the information content of the data set is thus minimal. Additional analysis shows that based on the reduced set of features in the Iris data, the model has the same quality using different data-mining techniques (sometimes the results were even better than with the original features).

  The interpretation of the principal components can be difficult at times. Although they are uncorrelated features constructed as linear combinations of the original features, and they have some desirable properties, they do not necessarily correspond to meaningful physical quantities. In some cases, such loss of interpretability is not satisfactory to the domain scientists, and they prefer others, usually feature-selection techniques.

  3.6 VALUE REDUCTION

  A reduction in the number of discrete values for a given feature is based on the second set of techniques in the data-reduction phase; these are the feature-discretization techniques. The task of feature-discretization techniques is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol. The benefits of these techniques are simplified data description and easy-to-understand data and final data-mining results. Also, more data-mining techniques are applicable with discrete feature values. An “old-fashioned” discretization is made manually, based on our a priori knowledge about the feature. For example, using common sense or consensus, a person’s age, given at the beginning of a data-mining process as a continuous value (between 0 and 150 years), may be classified into categorical segments: child, adolescent, adult, middle age, and elderly. Cutoff points are subjectively defined (Fig. 3.5). Two main questions exist about this reduction process:

 

Add Fast Bookmark
Load Fast Bookmark
Turn Navi On
Turn Navi On
Turn Navi On
Scroll Up
Turn Navi On
Scroll
Turn Navi On
183