Web Site

Economy-point.org



» Economics » Multivariate procedure » Topics begins with M » Main component analysis


Page modified: Friday, June 23, 2006 20:29:56

The main component analysis (English: Principal Component analysis (PCA), Pearson, 1901) is a method of the multivariate procedures in the statistics. Particularly in the image processing it also Karhunen transformation is called (after Kari Karhunen and Michel It is used frequently as extraction method with the factor analysis, is actually not however not bound to statistic conditions for model, but represents its own, purely numeric procedure. It was introduced to the 1930er years by Harold Hotelling, but used only since the 1970ern with the arising of higher performance computers more frequently. The computation by hand is extremely complex, requires redundant work, control calculations and thus personnel.

Conception of the main component analysis

In this procedure one tries to extract from variables with many characteristics some few latent factors which are intending for these characteristics. Mathematically a centerline transformation is accomplished: One minimizes the correlation of multidimensional characteristics by transfer into a vector space with new basis. The centerline transformation can be indicated by a matrix, which is formed from the self-vectors of second order statistics. The main component analysis is problem dependent thereby because for each data record its own transformation matrix must be computed.

Since the main component analysis is not completely simple, first an example follows.

Example

Are regarded artillery ships 2. World war. They are divided in the classes of battle ships, heavy cruisers, light cruisers and destroyers. Data for approx. 200 ships are present. The characteristics length, width, water displacement, depth, achievement of the machines, speed (at longer term possible maximum speed), radius of action and crew strength were seized. Actually the characteristics length, width, water displacement and depth measure all similar circumstances. One could speak here thus of a factor "„size "“. The question is whether still different factors determine the data. There is actually still another second clear factor, which is determined particularly by the achievement of the machines and the maximum speed. One could combine it into a factor "„speed "“.

Further sample applications of the main component analysis

If one applies the main component analysis to the purchase behavior of consumers, there are possibly latent factors like social status, age or family status, which motivate certain purchases. Here one could canalize by purposeful advertisement the inclination to buy accordingly.

If one has a statistic model with very many characteristics, the number of the variables in the model could be reduced if necessary with the help of the main component analysis, which mostly increases the model quality.

Application finds the main component analysis also in the image processing - in particular during the remote sensing. One can analyze satellite photographs and draw conclusions from it.

A further area, in which the PCA is used, is artificial intelligence, together with the neural nets. There the PCA serves the automatic classification for the characteristic separation in the framework.

Procedure

It is to be first ahead-sent that the variance of data is a measure for their information content.

The data are present as Punktwolke in a n-dimensional cartesian coordinate system. Now a new coordinate system is put into the Punktwolke and this coordinate system rotates: The first axle is to be put in such a way by the Punktwolke that the variance of the data in this direction becomes maximum. The second axle stands on the first axle perpendicularly. In its direction the variance at the second largest etc. for the n-dimensional data is gives it thus in principle to n many axles, which are one on the other perpendicularly, them orthogonal. The total variance of the data is the sum of these "“axle variances"”. Now by the first p (p<n) axles if the largest percentage of the total variance is covered, the factors, which are represented by the new axles, appear sufficient for the information content of the data.

Frequently the factors cannot be interpreted contentwise. In the statistics one speaks of the fact that no understandable hypothesis can be attributed to them. They are then useless. (see factor analysis)

Statistic model

One regards p many variates Xj, which are centered concerning their expectancy values, i.e. its expectancy values were subtracted from the variates. These variates are summarized in one - random vektor x. x has a zero-vector and - second order statistics S as expectancy value vector, whereby S is symmetrical and positively definite. The eigenvalues (j=1,"…, p) the matrix S are descending according to the size arranged. They are specified as diagonal elements in the diagonal matrix The self-vectors belonging to them form the orthogonale matrix G. it apply then \ underline \ Lambda = \ underline \ Gamma^T \ underline \ sigma \ underline \ gamma.

The random vektor x linear transformed too \ underline x \ rightarrow \ underline y = \ underline \ Gamma^T \ underline x.

For elucidation we regard a three-dimensional random vektor

\ underline x = \ begin {pmatrix} to X_1 \ \ X_2 \ \ X_3 \ end {pmatrix}.

The matrix of the eigenvalues is

\ underline \ Lambda= \ begin {pmatrix} \ lambda_A& to 0 &0 \ \ 0 & \ lambda_B& 0 \ \ 0&0& \ lambda_C \ end {pmatrix},

whereby > > is.

- self-vectors can be summarized in the matrix G:

\ underline \ Gamma= \ begin {pmatrix} \ underline \ gamma_A& \ underline \ gamma_B & \ underline \ gamma_C \ end {pmatrix} = \ begin {pmatrix} \ gamma_ {to 1A} & \ gamma_ {1B} & \ gamma_ {1C} \ \ \ gamma_ {2A} & \ gamma_ {2B} & \ gamma_ {2C} \ \ \ gamma_ {3A} & \ gamma_ {3B} & \ gamma_ {3C} \ end {pmatrix}.

The multiplication

\ underline x \ rightarrow \ underline y = \ underline \ Gamma^T \ underline x

results in the equations

Y_A= \ gamma_ {1A} X_1+ \ gamma_ {2A} X_2+ \ gamma_ {3A} X_3
Y_B= \ gamma_ {1B} X_1+ \ gamma_ {2B} X_2+ \ gamma_ {3B} X_3
Y_C= \ gamma_ {1C} X_1+ \ gamma_ {2C} X_2+ \ gamma_ {3C} X_3.

The variance of YAist

varY_A = \ lambda_A,

thus the main component YA has the largest portion of the total variance of the data, YB the second largest portion of etc. the elements (j=1,2,3; k=A,) one could call B, C contribution of the variable Xj at the factor k. The matrix G one designates in this connection as charge matrix, it indicates, "“as highly a variable X on a factor Y loads"”.

Estimation of the model parameters

Forwards, by the characteristic values the sample correlation matrix is calculated for couches concretely raised data with p characteristics. From this matrix one determines then the eigenvalues and self-vectors for the main component analysis.

Example with three variables

This above example is now clarified in figures:

We regard the variables length, width, knot. The strewing diagrams show an impression over the common distribution of the variables.

With these three variables with the statistic program complex SPSS a main component analysis was accomplished. The charge matrix G is

FactorABC
Length0,8620,481-0,159
Width0,9770,0830,198
Knot-0,6790,7300,082

The factor yA exposes itself thus together

Y_A = 0.862 \ cdot \ mbox {length} + 0.977 \ cdot \ mbox {width} - 0.679 \ \ mbox {knot cdot},

above all the contribution from length and width to the first factor is large. With the second factor above all the contribution of knots is large. The third factor is unclear and probably insignificant.

The total variance of the data distributes itself as follows on the main components:

FactorEigenvalue Per cent of the total varianceProportional portion of the variance cumulated of total variance
A2,1671,9771,97
B0,7725,6797,64
C0,072,36100,00

By the first two main components already 97.64% of the entire variance of the data are thus taken off. The third factor does not contribute anything considerable to the information content.

Example with eight variables

Now eight characteristics of the artillery ships of a main component analysis were submitted. The table of the charge matrix, here "“component matrix"” mentioned shows that above all the variables load length, width, depth, water displacement and crew strength highly on the first main component. One could call this component "“size"”. The second component is explained mostly by HP and knot. Those could be called "“speed"”. A third component loads still highly on radius of action.

The two first factors take already approx. 84% off of the information of the ship data, the third factor seized again approx. 10%. The additional contribution of the remaining components is insignificant.


Articles in category "Main component analysis"

We found here 3 articles.

M

» Main component analysis
» Multi-dimensional scaling
» Multivariate procedures

Related Websites

We found here 6 related websites.

Page cached: Wednesday, July 5, 2006 14:55:49
Valid XHTML 1.0!  Valid CSS!

Page copy protected against web site content infringement by Copyscape