Distance and Dissimilarities Metrics

Author

Affiliation

Published

June 24, 2024

vegan provides a variety of distance and dissimilarity measures through the vegdist() function. Here is some background on some commonly used distance and dissimilarity measures that you might find useful.

Bray-Curtis Dissimilarity

Definition: The Bray-Curtis dissimilarity index is calculated as: $D_{B C} = \frac{\sum | x_{i} - y_{i} |}{\sum (x_{i} + y_{i})}$ where $x_{i}$ and $y_{i}$ are the counts or abundances of species $i$ in samples $x$ and $y$ , respectively.
Properties: Ranges from 0 (identical communities) to 1 (completely different communities). It is sensitive to the abundance of species and is unaffected by joint absences.
Ecological Use: Used for species abundance data. Commonly used to compare the composition of different communities, track changes in community structure over time, and assess the impact of environmental gradients on community composition.
Limitations: It is sensitive to outliers, so the index can be influenced by extreme values or rare species with high abundances. It assumes a linear relationship between species abundances and dissimilarity, which may not always hold in ecological communities.

Sørensen (Dice) Dissimilarity

Definition: The Sørensen dissimilarity is given by: $D_{S} = 1 - \frac{2 C}{A + B}$ where $A$ and $B$ are the total counts of species in samples $x$ and $y$ , respectively, and $C$ is the sum of the minimum counts of shared species.
Properties: Similar to Bray-Curtis but puts more emphasis on the presence of shared species. It ranges from 0 (identical species composition) to 1 (no shared species).
Ecological Use: Focus on presence-absence data. Used to compare the similarity of species composition between two communities. The Sørensen dissimilarity index is fundamentally rooted in the concept of beta diversity, which quantifies the difference in species composition between two or more communities.
Limitations: The index does not account for the relative abundance of species: two communities with very different species abundances but similar species richness could have the same Sørensen dissimilarity. The presence or absence of rare species can disproportionately influence the Sørensen dissimilarity.

Jaccard Index

Definition: The Jaccard index is defined as: $D_{J} = 1 - \frac{C}{A + B - C}$ where $A$ and $B$ are the total counts of species in samples $x$ and $y$ , and $C$ is the count of shared species.
Properties: Ranges from 0 (complete similarity) to 1 (complete dissimilarity). It only considers the presence or absence of species, not their abundance.
Ecological Use: Often used for presence-absence data to compare species diversity and composition between sites.
Limitations: The index does not account for the relative abundance of species: two communities with very different species abundances but similar species richness could have the same Jaccard dissimilarity. The Jaccard index can be biased by differences in sampling effort between sites. It is sensitive to the presence of rare species.

Hellinger Distance

Definition: Hellinger distance is calculated using the Hellinger transformation: $h_{i j} = \sqrt{x_{i j} / \sum x_{i .}}$ where $x_{i j}$ is the abundance of species $j$ in sample $i$ . The distance is then the Euclidean distance between these transformed values.
Properties: The Hellinger distance is a measure of dissimilarity between two probability distributions. These probability distributions often represent the relative abundances of species in different communities. Unlike the Jaccard and Sørensen indices, which focus on presence/absence, the Hellinger distance accounts for both the presence/absence and the abundance of species. It ranges from 0 to 1. It reduces the influence of dominant species and is suited for relative abundance data.
Ecological Use: Used for ordination and clustering of community data to minimise the effect of large differences in species abundances. It reduces the influence of highly abundant species, making it less sensitive to outliers than some other dissimilarity measures. It can handle situations where a species is absent from one community but present in another, avoiding the “double zero” problem encountered by some other metrics.
Limitations: Can be less intuitive to interpret than measures like Bray-Curtis, which directly relate to differences in abundances.

Gower Distance

Definition: The Gower distance is a general similarity coefficient that can handle different types of variables (quantitative, binary, categorical). It is calculated as: $D_{G} = \frac{\sum w_{i} d_{i}}{\sum w_{i}}$ where $d_{i}$ is the distance between samples $x$ and $y$ for variable $i$ , and $w_{i}$ is the weight associated with variable $i$ .
Properties: Ranges from 0 to 1. It can incorporate various types of ecological data, making it very flexible. The specific calculation of $d_{i}$ depends on the data type of variable $i$ : for quantitative data, it typically uses the Manhattan distance (absolute difference) after range normalisation; for categorical data it usually uses the Dice coefficient (proportion of mismatches); and for ordinal data, Manhattan distance on ranked values with adjustments for ties is used.
Ecological Use: Suitable for mixed data types, such as ecological surveys combining species counts, presence-absence data, and environmental variables.
Limitations: The composite nature of the Gower distance can make the interpretation of the resulting dissimilarity values less straightforward than for simpler metrics like Euclidean distance. The choice of weights and distance measures for different variable types can affect the results.

Euclidean Distance

Definition: The Euclidean distance between two samples is the straight-line distance in multivariate space: $D_{E} = \sqrt{\sum (x_{i} - y_{i})^{2}}$ where $x_{i}$ and $y_{i}$ are the values of variable $i$ in samples $x$ and $y$ , respectively.
Properties: Sensitive to differences in scale and magnitude of data. Often used as a baseline comparison.
Ecological Use: Useful for quantitative data where differences in magnitude are important. Not commonly used for species abundance data due to sensitivity to large values.

Manhattan Distance

Definition: The Manhattan distance (or city block distance) is: $D_{M} = \sum | x_{i} - y_{i} |$ where $x_{i}$ and $y_{i}$ are the values of variable $i$ in samples $x$ and $y$ , respectively.
Properties: The Manhattan distance is also known as city block distance or L1 distance. It is a geometric measure of distance between two points in a multidimensional space. Unlike the Euclidean distance, which measures the straight-line distance, the Manhattan distance calculates the distance along the axes of the coordinate system. In multivariate community data, each dimension often represents a different species, and the coordinates of a point represent the abundances of those species in a particular community or sample. The Manhattan distance between two points then reflects the total difference in species abundances between those communities. Less sensitive to outliers compared to Euclidean distance. Values are always non-negative and is zero only if the two points are identical. Larger values indicate greater dissimilarity between the communities.
Ecological Use: Used for quantitative ecological data, especially when dealing with high-dimensional datasets where large outliers can skew results.
Limitations: It treats each variable (species) independently and doesn’t account for potential correlations between them. It can be sensitive to differences in scale between variables.

Canberra Distance

Definition: The Canberra distance is calculated as: $D_{C} = \sum \frac{| x_{i} - y_{i} |}{| x_{i} | + | y_{i} |}$
Properties: The Canberra distance is a numerical measure of the dissimilarity between two sets of data points. These data points often represent the abundances of different species in two communities or samples. The Canberra distance is calculated by summing the weighted absolute differences between the values of each variable (species) in the two sets. The weights are the inverse of the sum of the absolute values of the variables, which means that variables with larger values have less influence on the distance. Sensitive to differences in small values and zeroes, but unaffected by differences in the scales of variables. The Canberra distance is always non-negative and is zero only if the two points are identical. Larger values indicate greater dissimilarity between the communities.
Ecological Use: Suitable for ecological data where small differences are important, such as in studies of rare species or trace element concentrations.
Limitations: The Canberra distance can be sensitive to small differences in variables with very low values, which may not be biologically meaningful. It can also be affected by the presence of zeros in the data.

Reuse

CC BY-NC-SA 4.0

Citation

BibTeX citation:

@online{smit,_a._j.2024,
  author = {Smit, A. J.,},
  title = {Distance and {Dissimilarities} {Metrics}},
  date = {2024-06-24},
  url = {http://tangledbank.netlify.app/BCB743/dis-metrics.html},
  langid = {en}
}

For attribution, please cite this work as:

Smit, A. J. (2024) Distance and Dissimilarities Metrics. http://tangledbank.netlify.app/BCB743/dis-metrics.html.

--- date: "2024-06-24" title: "Distance and Dissimilarities Metrics" --- **vegan** provides a variety of distance and dissimilarity measures through the `vegdist()` function. Here is some background on some commonly used distance and dissimilarity measures that you might find useful. ## Bray-Curtis Dissimilarity - **Definition**: The Bray-Curtis dissimilarity index is calculated as: $$ D_{BC} = \frac{\sum |x_i - y_i|}{\sum (x_i + y_i)} $$ where $x_i$ and $y_i$ are the counts or abundances of species $i$ in samples $x$ and $y$, respectively. - **Properties**: Ranges from 0 (identical communities) to 1 (completely different communities). It is sensitive to the abundance of species and is unaffected by joint absences. - **Ecological Use**: Used for species abundance data. Commonly used to compare the composition of different communities, track changes in community structure over time, and assess the impact of environmental gradients on community composition. - **Limitations:** It is sensitive to outliers, so the index can be influenced by extreme values or rare species with high abundances. It assumes a linear relationship between species abundances and dissimilarity, which may not always hold in ecological communities. ## Sørensen (Dice) Dissimilarity - **Definition**: The Sørensen dissimilarity is given by: $$ D_S = 1 - \frac{2C}{A + B} $$ where $A$ and $B$ are the total counts of species in samples $x$ and $y$, respectively, and $C$ is the sum of the minimum counts of shared species. - **Properties**: Similar to Bray-Curtis but puts more emphasis on the presence of shared species. It ranges from 0 (identical species composition) to 1 (no shared species). - **Ecological Use**: Focus on presence-absence data. Used to compare the similarity of species composition between two communities. The Sørensen dissimilarity index is fundamentally rooted in the concept of beta diversity, which quantifies the difference in species composition between two or more communities. - **Limitations:** The index does not account for the relative abundance of species: two communities with very different species abundances but similar species richness could have the same Sørensen dissimilarity. The presence or absence of rare species can disproportionately influence the Sørensen dissimilarity. ## Jaccard Index - **Definition**: The Jaccard index is defined as: $$ D_J = 1 - \frac{C}{A + B - C} $$ where $A$ and $B$ are the total counts of species in samples $x$ and $y$, and $C$ is the count of shared species. - **Properties**: Ranges from 0 (complete similarity) to 1 (complete dissimilarity). It only considers the presence or absence of species, not their abundance. - **Ecological Use**: Often used for presence-absence data to compare species diversity and composition between sites. - **Limitations:** The index does not account for the relative abundance of species: two communities with very different species abundances but similar species richness could have the same Jaccard dissimilarity. The Jaccard index can be biased by differences in sampling effort between sites. It is sensitive to the presence of rare species. ## Hellinger Distance - **Definition**: Hellinger distance is calculated using the Hellinger transformation: $$ h_{ij} = \sqrt{x_{ij} / \sum x_{i.}} $$ where $x_{ij}$ is the abundance of species $j$ in sample $i$. The distance is then the Euclidean distance between these transformed values. - **Properties**: The Hellinger distance is a measure of dissimilarity between two probability distributions. These probability distributions often represent the relative abundances of species in different communities. Unlike the Jaccard and Sørensen indices, which focus on presence/absence, the Hellinger distance accounts for both the presence/absence and the abundance of species. It ranges from 0 to 1. It reduces the influence of dominant species and is suited for relative abundance data. - **Ecological Use**: Used for ordination and clustering of community data to minimise the effect of large differences in species abundances. It reduces the influence of highly abundant species, making it less sensitive to outliers than some other dissimilarity measures. It can handle situations where a species is absent from one community but present in another, avoiding the "double zero" problem encountered by some other metrics. - **Limitations:** Can be less intuitive to interpret than measures like Bray-Curtis, which directly relate to differences in abundances. ## Gower Distance - **Definition**: The Gower distance is a general similarity coefficient that can handle different types of variables (quantitative, binary, categorical). It is calculated as: $$ D_G = \frac{\sum w_i d_i}{\sum w_i} $$ where $d_i$ is the distance between samples $x$ and $y$ for variable $i$, and $w_i$ is the weight associated with variable $i$. - **Properties**: Ranges from 0 to 1. It can incorporate various types of ecological data, making it very flexible. The specific calculation of $d_i$ depends on the data type of variable $i$: for quantitative data, it typically uses the Manhattan distance (absolute difference) after range normalisation; for categorical data it usually uses the Dice coefficient (proportion of mismatches); and for ordinal data, Manhattan distance on ranked values with adjustments for ties is used. - **Ecological Use**: Suitable for mixed data types, such as ecological surveys combining species counts, presence-absence data, and environmental variables. - **Limitations:** The composite nature of the Gower distance can make the interpretation of the resulting dissimilarity values less straightforward than for simpler metrics like Euclidean distance. The choice of weights and distance measures for different variable types can affect the results. ## Euclidean Distance - **Definition**: The Euclidean distance between two samples is the straight-line distance in multivariate space: $$ D_E = \sqrt{\sum (x_i - y_i)^2} $$ where $x_i$ and $y_i$ are the values of variable $i$ in samples $x$ and $y$, respectively. - **Properties**: Sensitive to differences in scale and magnitude of data. Often used as a baseline comparison. - **Ecological Use**: Useful for quantitative data where differences in magnitude are important. Not commonly used for species abundance data due to sensitivity to large values. ## Manhattan Distance - **Definition**: The Manhattan distance (or city block distance) is: $$ D_M = \sum |x_i - y_i| $$ where $x_i$ and $y_i$ are the values of variable $i$ in samples $x$ and $y$, respectively. - **Properties**: The Manhattan distance is also known as city block distance or L1 distance. It is a geometric measure of distance between two points in a multidimensional space. Unlike the Euclidean distance, which measures the straight-line distance, the Manhattan distance calculates the distance along the axes of the coordinate system. In multivariate community data, each dimension often represents a different species, and the coordinates of a point represent the abundances of those species in a particular community or sample. The Manhattan distance between two points then reflects the total difference in species abundances between those communities. Less sensitive to outliers compared to Euclidean distance. Values are always non-negative and is zero only if the two points are identical. Larger values indicate greater dissimilarity between the communities. - **Ecological Use**: Used for quantitative ecological data, especially when dealing with high-dimensional datasets where large outliers can skew results. - **Limitations:** It treats each variable (species) independently and doesn't account for potential correlations between them. It can be sensitive to differences in scale between variables. ## Canberra Distance - **Definition**: The Canberra distance is calculated as: $$ D_C = \sum \frac{|x_i - y_i|}{|x_i| + |y_i|} $$ - **Properties**: The Canberra distance is a numerical measure of the dissimilarity between two sets of data points. These data points often represent the abundances of different species in two communities or samples. The Canberra distance is calculated by summing the weighted absolute differences between the values of each variable (species) in the two sets. The weights are the inverse of the sum of the absolute values of the variables, which means that variables with larger values have less influence on the distance. Sensitive to differences in small values and zeroes, but unaffected by differences in the scales of variables. The Canberra distance is always non-negative and is zero only if the two points are identical. Larger values indicate greater dissimilarity between the communities. - **Ecological Use**: Suitable for ecological data where small differences are important, such as in studies of rare species or trace element concentrations. - **Limitations:** The Canberra distance can be sensitive to small differences in variables with very low values, which may not be biologically meaningful. It can also be affected by the presence of zeros in the data.