Mahalanobis distance

Information about Mahalanobis distance

In statistics, Mahalanobis distance is a distance measure introduced by P. C. Mahalanobis in 1936. It is based on correlations between variables by which different patterns can be identified and analysed. It is a useful way of determining similarity of an unknown sample set to a known one. It differs from Euclidean distance in that it takes into account the correlations of the data set and is scale-invariant, i.e. not dependent on the scale of measurements.

Formally, the Mahalanobis distance from a group of values with mean and covariance matrix for a multivariate vector is defined as:



Mahalanobis distance can also be defined as dissimilarity measure between two random vectors and of the same distribution with the covariance matrix :



If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is called the normalized Euclidean distance:



where is the standard deviation of the over the sample set.

Intuitive explanation

Consider the problem of estimating the probability that a test point in N-dimensional Euclidean space belongs to a set, where we are given sample points that definitely belong to that set. Our first step would be to find the average or center of mass of the sample points. Intuitively, the closer the point in question is to this center of mass, the more likely it is to belong to the set.

However, we also need to know how large the set is. The simplistic approach is to estimate the standard deviation of the distances of the sample points from the center of mass. If the distance between the test point and the center of mass is less than one standard deviation, then we conclude that it is highly probable that the test point belongs to the set. The farther away it is, the more likely that the test point should not be classified as belonging to the set.

This intuitive approach can be made quantitative by defining the normalized distance between the test point and the set to be . By plugging this into the normal distribution we get the probability of the test point belonging to the set.

The drawback of the above approach was that we assumed that the sample points are distributed about the center of mass in a spherical manner. Were the distribution to be decidedly non-spherical, for instance ellipsoidal, then we would expect the probability of the test point belonging to the set to depend not only on the distance from the center of mass, but also on the direction. In those directions where the ellipsoid has a short axis the test point must be closer, while in those where the axis is long the test point can be further away from the center.

Putting this on a mathematical basis, the ellipsoid that best represents the set's probability distribution can be estimated by building the covariance matrix of the samples. The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point.

Relationship to leverage

Mahalanobis distance is closely related to the leverage statistic h. The Mahalanobis distance of a data point from the centroid of a multivariate data set is (N − 1) times the leverage of that data point, where N is the number of data points in the set.

Applications

Mahalanobis distance is widely used in cluster analysis and other classification techniques. It is closely related to Hotelling's T-square distribution used for multivariate statistical testing.

In order to use the Mahalanobis distance to classify a test point as belonging to one of N classes, one first estimates the covariance matrix of each class, usually based on samples known to belong to each class. Then, given a test sample, one computes the Mahalanobis distance to each class, and classifies the test point as belonging to that class for which the Mahalanobis distance is minimal. Using the probabilistic interpretation given above, this is equivalent to selecting the class with the highest probability.

Also, Mahalanobis distance and leverage are often used to detect outliers especially in the development of linear regression models. A point that has a greater Mahalanobis distance from the rest of the sample population of points is said to have higher leverage since it has a greater influence on the slope or coefficients of the regression equation.

References

  • P.C. Mahalanobis, On the generalised distance in statistics, Proceedings of the National Institute of Science of India 12 (1936) 49-55
Statistics is a mathematical science pertaining to the collection, analysis, interpretation or explanation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities.
..... Click the link for more information.
Distance is a numerical description of how far apart objects are at any given moment in time. In physics or everyday discussion, distance may refer to a physical length, a period of time, or an estimation based on other criteria (e.g. "two counties over").
..... Click the link for more information.
Prasanta Chandra Mahalanobis (Bangla: প্রশান্ত চন্দ্র মহলানবিস) (June 29 1893–June 28, 1972) was an Indian scientist and applied statistician.
..... Click the link for more information.
19th century - 20th century - 21st century
1900s  1910s  1920s  - 1930s -  1940s  1950s  1960s
1933 1934 1935 - 1936 - 1937 1938 1939

Year 1936 (MCMXXXVI
..... Click the link for more information.
correlation, also called correlation coefficient, indicates the strength and direction of a linear relationship between two random variables. In general statistical usage, correlation or co-relation refers to the departure of two variables from independence.
..... Click the link for more information.
sample is a subset of a population. Typically, the population is very large, making a census or a complete enumeration of all the values in the population impractical or impossible. The sample represents a subset of manageable size.
..... Click the link for more information.
List of topics named after Euclid (Euclidean or less commonly Euclidian)
  • Euclidean space
  • Euclidean geometry
  • Euclid's Elements
  • Euclidean domain
  • Euclidean distance
  • Euclidean ball
  • Euclidean algorithm
  • Euclidean distance map

..... Click the link for more information.
data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question.
..... Click the link for more information.
In statistics and probability theory, the covariance matrix is a matrix of covariances between elements of a vector. It is the natural generalization to higher dimensions of the concept of the variance of a scalar-valued random variable.
..... Click the link for more information.
A multivariate random variable or random vector is a vector X = (X1, ..., Xn) whose components are scalar-valued random variables on the same probability space (Ω, P).
..... Click the link for more information.
probability distribution that assigns a probability to every subset (more precisely every measurable subset) of its state space in such a way that the probability axioms are satisfied.
..... Click the link for more information.
In mathematics, the Euclidean distance or Euclidean metric is the "ordinary" distance between two points that one would measure with a ruler, which can be proven by repeated application of the Pythagorean theorem.
..... Click the link for more information.
In probability and statistics, the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. It is usually denoted with the letter σ (lower case sigma).
..... Click the link for more information.
Euclidean space. Most of this article is devoted to developing the modern language necessary for the conceptual leap to higher dimensions.

An essential property of a Euclidean space is its flatness. Other spaces exist in geometry that are not Euclidean.
..... Click the link for more information.
In probability and statistics, the standard deviation of a probability distribution, random variable, or population or multiset of values is a measure of the spread of its values. It is usually denoted with the letter σ (lower case sigma).
..... Click the link for more information.
Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure.
..... Click the link for more information.
Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred to as traits, variables, characters, etc) and based on a training set of previously labeled
..... Click the link for more information.
In statistics, Hotelling's T-square statistic,[1] named for Harold Hotelling, is a generalization of Student's t statistic that is used in multivariate hypothesis testing.
..... Click the link for more information.
outlier is an observation that is numerically distant from the rest of the data, or Renee Masse. Statistics derived from data sets that include outliers will often be misleading.
..... Click the link for more information.
In statistics, linear regression is a regression method that models the relationship between a dependent variable Y, independent variables Xi, i = 1, ..., p, and a random term ε.
..... Click the link for more information.

This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.