Web Site

Economy-point.org



» Economics » Political economy » Topics begins with C » Cluster analysis


Page modified: Friday, June 23, 2006 20:48:29

By cluster analysis one understands different multivariate procedures of the data analysis for the determination about groups (cluster) of matching objects from a fundamental set about numerically described objects. The objects can be for example data records of measured values or pixels, in which arranged accumulations or hierarchies are to be found.

Procedures of the cluster analysis can be begun for automatic classification, for the recognition of samples in the image processing and to the DATA Mining.

Principle

The objects which can be examined are understood in the cluster analysis as multivariate distributed variates and usually summarized in the form of vectors as points in a vector space. The number of components of the data vectors forms the dimension of the vector space. A cluster is an accumulation of points. The distance of the points is to each other smaller within a cluster than the distance to points other cluster. Cluster can be defined also as group of points, which have among themselves or regarding a computed emphasis a minimum compensation. In addition the choice of a distance measure is necessary. In certain cases the distances (and/or turned around the similarities) of the objects are among themselves directly well-known and must not from the representation in the vector space be determined.

History

Historically seen the procedure originates from the taxonomy in biology, where over a Clusterung of related kinds an order of the organisms is determined - however no automatic computation methods were originally used there. In the meantime among other things its gene sequences can be compared for the determination of the relationship by organism.

See also: Kladistik

Algorithms

Data clustering algorithms can be partitionierend hierarchical or, whereby one divides first still into agglomerierende (bottom UP) or partitioning (top down) algorithms. Further one differentiates between supervised (supervised) and non-supervised (unsupervised) algorithms.

Depending upon algorithm a distance function must for the determination of the distance of two elements (D (A, b), for example the Euclidean distance) and/or a method to the computation of the center or Zentroiden of a cluster (\ without A, for example the average value) admits to be. Instead of a distance function some algorithms work also with a similarity function.

Hierarchical Clustern

Principle

Accumulating procedures (agglomerative clustering) and dividing procedures (divisive clustering) can fundamental be differentiated. With the accumulating procedures, which are in practice more frequently used, gradual individual objects into Clustern and these are combined to larger groups, while with the dividing procedures larger groups are partitioned gradually ever more finely. With the hierarchical Clusterung developing tree structure is usually visualized with a Dendrogramm.

With the accumulating Clustern first each object is understood as its own cluster with an element. Now in each step each other in each case the next cluster into a cluster are combined. The procedure can be terminated, if all cluster exceed a certain distance to each other or if a sufficient small number of Clustern were determined. From different methods for the determination of the distance of two cluster different procedures result. A distance function must be given D for the distance from two individual elements.

Spacer distances from Clustern

For the distance of two cluster A and B can be used the among other things following distances:

  • The minimum distance of two elements from the two Clustern (single linkage clustering) min_ {A \ in A, b \ in B} \ {D (A, b) \}
  • The maximum distance of two elements from the two Clustern (linkage complete clustering) max_ {A \ in A, b \ in B} \ {D (A, b) \}
  • The average distance of all pairs of elements from the two Clustern (AVERAGE linkage clustering) \ frac {1} {|A||B|} \ sum_ {A \ in A, b \ in B} D (A, b)
  • The average distance of all pairs of elements from the combination of A and B (AVERAGE group linkage) \ frac {1} {|C|} \ sum_ {x, y \ in C, C=A \ cup B} D (x, y)
  • The distance of the average values of the two cluster (centroid method) D (\ without A, \ without b)
  • The increase of the variance when combining A and B (Ward's method)

\ frac {D (\ without A, \ without b)}{1/|A|+1/|B|}

Further methods: Density Linkage, uniform Kernel, Wong's hybrid, EML, flexible beta, McQuitty's Similarity analysis, median


Page cached: Wednesday, July 5, 2006 14:47:04
Valid XHTML 1.0!  Valid CSS!

Page copy protected against web site content infringement by Copyscape