# Distance Metrics

The `metrics/distance`

module typically gathers metrics aiming at finding a theoretical “distance” between two sequences.

They range from computing the edit distance between two strings to retrieving the distance between two points in space.

**Note**: in all the testers below, know that you can separate sequences’ elements by using a comma if you want to compare things different from single strings.

## Summary

Modules under the `talisman/metrics/distance/distance`

namespace:

- canberra
- chebyshev
- cosine
- damerau-levenshtein
- dice
- euclidean
- eudex
- hamming
- jaccard
- jaro
- jaro-winkler
- levenshtein
- manhattan
- minkowski
- mra
- overlap
- sorensen
- tversky

## Use case

*Edit distance*

Let’s assume we want to determine whether two strings are similar or not.

For instance, let’s compare *healed* & *sealed*.

They look somewhat the same, don’t they?

Well, a computer would have a hard time stating this fact, since, at the end of the day, it only dabbles in mathematics.

That’s why we have to use distance metrics to assess whether both words are similar.

In the example below, we used the Dice coefficient, rating the similarity of two words and ranging from 0 to 1:

```
// Let's use the Dice coefficient
import dice from 'talisman/metrics/distance/dice';
// We'll say two strings are similar if their Dice coefficient >= 0.8
function compare(a, b) {
return dice(a, b) >= 0.8;
}
compare('healed', 'sealed');
>>> true
compare('healed', 'sold');
>>> false
```

## canberra

Reference: https://en.wikipedia.org/wiki/Canberra_distance

The Canberra distance is a weighted version of the Manhattan distance.

```
import canberra from 'talisman/metrics/distance/canberra';
canberra([1, 3], [4, 5]);
>>> 0.85
```

## chebyshev

Reference: https://en.wikipedia.org/wiki/Chebyshev_distance

The Chebyshev distance between two vectors is the greatest of their differences along any coordinate dimension.

```
import chebyshev from 'talisman/metrics/distance/chebyshev';
chebyshev([1, 3], [4, 5]);
>>> 3
```

## cosine

Reference: https://en.wikipedia.org/wiki/Cosine_similarity

Computes the cosine similarity between two vectors of same dimension.

```
import cosine from 'talisman/metrics/distance/cosine';
// Alternatively
import {similarity, distance} from 'talisman/metrics/distance/cosine';
cosine([1, 3], [4, 5]);
>>> 0.94
distance(a, b) === 1 - similarity(a, b);
```

## damerau-levenshtein

Reference: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The Damerau-Levenshtein is an improvement over the classical Levenshtein distance.

That is to say this variant will handle transpositions as well as the usual additions, deletions & substitions.

```
import damerauLevenshtein from 'talisman/metrics/distance/damerau-levenshtein';
damerauLevenshtein('this', 'tihs');
>>> 1
```

## dice

Reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

*Dice, Lee R. (1945). “Measures of the Amount of Ecologic Association Between Species”. Ecology 26 (3): 297–302.*

Computes the Dice coefficient between two sequences, usually strings.

Note that the Dice & Sorensen coefficients are the same thing and that you remain free to use the module you prefer.

```
import dice from 'talisman/metrics/distance/dice';
// Alternatively
import {
index,
coefficient,
similarity,
distance
} from 'talisman/metrics/distance/dice';
dice('healed', 'sealed');
>>> 0.8
distance(a, b) === 1 - similarity(a, b)
```

## euclidean

Reference: https://en.wikipedia.org/wiki/Euclidean_distance

Computes the euclidean distance between two points in a N-dimensions space.

This distance is equal to the length of the straight line drawn between both points.

```
import euclidean from 'talisman/metrics/distance/euclidean';
// Alternatively
import {squared} from 'talisman/metrics/distance/euclidean';
euclidean([1, 3], [4, 5]);
>>> ~3.61
// The squared distance is sometimes used to same some computation
squared([1, 3], [4, 5]);
>>> 13
```

*Euclidean distance*

*Squared euclidean distance*

## eudex

Reference: https://github.com/ticki/eudex

Author: ticki

Functions tailored to compare two string using Eudex hashes to assess whether they seem similar or not.

```
import {distance, isSimilar} from 'talisman/metrics/distance/eudex';
distance('jumbo', 'jumpo');
>>> 2
isSimilar('jumbo', 'jumpo');
>>> true
isSimilar('hello', 'world');
>>> false
```

*Distance*

*Similarity*

## hamming

Reference: https://en.wikipedia.org/wiki/Hamming_distance

*Hamming, Richard W. (1950), “Error detecting and error correcting codes”, Bell System Technical Journal 29 (2): 147–160*

Computes the Hamming distance between two equal-length sequences.

```
import hamming from 'talisman/metrics/distance/hamming';
hamming('night', 'nacht');
>>> 2
hamming([1, 0, 1, 1, 1, 0, 1], [1, 0, 0, 1, 0, 0, 1]);
>>> 2
```

## jaccard

Reference: https://en.wikipedia.org/wiki/Jaccard_index

*Jaccard, Paul (1912), “The distribution of the flora in the alpine zone”, New Phytologist 11: 37–50*

Computes the Jaccard index between two sequences, usually strings.

```
import jaccard from 'talisman/metrics/distance/jaccard';
// Alternatively
import {
index,
similarity,
distance
} from 'talisman/metrics/distance/jaccard';
index === similarity === jaccard
jaccard('context', 'contact');
>>> ~0.57
distance(a, b) === 1 - similarity(a, b)
```

## jaro

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

*Jaro, M. A. (1989). “Advances in record linkage methodology as applied to the 1985 census of Tampa Florida”. Journal of the American Statistical Association 84 (406): 414–20*

*Jaro, M. A. (1995). “Probabilistic linkage of large public health data file”. Statistics in Medicine 14 (5–7): 491–8.*

Computes the Jaro distance between two sequences, usually strings.

See also the Jaro-Winkler distance.

```
import jaro from 'talisman/metrics/distance/jaro';
// Alternatively
import {
similarity,
distance
} from 'talisman/metrics/distance/jaro';
similarity === jaro
jaro('Duane', 'Dwayne');
>>> 0.82
distance(a, b) === 1 - similarity(a, b)
```

## jaro-winkler

Computes the Jaro-Winkler distance between two sequences, usually strings.

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

*Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354–359.*

```
import jaroWinkler from 'talisman/metrics/distance/jaro-winkler';
// Alternatively
import {
similarity,
distance
} from 'talisman/metrics/distance/jaro-winkler';
similarity === jaroWinkler
jaroWinkler('Duane', 'Dwayne');
>>> 0.84
distance(a, b) === 1 - similarity(a, b)
```

*Using custom parameters*

```
import {custom} from 'talisman/metrics/distance/jaro-winkler';
const parameters = {
boostThreshold: 0.7,
scalingFactor: 0.1
};
custom(parameters, 'Duane', 'Dwayne');
>>> 0.84
```

Options are the following:

**[boostThreshold]**`number`

(0.7): boost threshold comprised between 0 and 1.**[scalingFactor]**`number`

(0.1): scaling factor. Should not exceed 0.25.

## levenshtein

Reference: https://en.wikipedia.org/wiki/Levenshtein_distance

*Levenshtein, Vladimir I. (February 1966). “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.*

The Levenshtein distance is probably the most known metric dealing with edit distance.

This function therefore computes the absolute edit distance between two sequences, usually strings.

```
import levenshtein from 'talisman/metrics/distance/levenshtein';
levenshtein('book', 'back');
>>> 2
```

## manhattan

Reference: https://en.wikipedia.org/wiki/Taxicab_geometry

Computes the Manhattan distance between two points in a N-dimensions space.

This distance if also often called the “city block” distance because it won’t draw a straight line between both points but rather follow the other edges of the triangle.

```
import manhattan from 'talisman/metrics/distance/manhattan';
manhattan([1, 3], [4, 5]);
>>> 5
```

## minkowski

Reference: https://en.wikipedia.org/wiki/Minkowski_distance

The Minkowski distance is a generalization of both the euclidean & the Manhattan distance.

```
import minkowski from 'talisman/metrics/distance/minkowski';
minkowski(1, [1, 3], [4, 5]);
>>> 5
const manhattan = minkowski.bind(null, 1);
const euclidean = minkowski.bind(null, 2);
```

*Arguments*

**p**`number`

- value for p. Must >= 1.**a**`array`

- the first vector.**b**`array`

- the second vector.

## mra

Reference: https://en.wikipedia.org/wiki/Match_rating_approach

*Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.*

Assesses the similarity between two names by using the Match Rating Approach comparison.

Note that if the difference between the strings’ codex’ lengths is superior to 3, the comparison won’t be done as the algorithm will deem the strings too different.

```
import mra from 'talisman/metrics/distance/mra';
mra('Catherine', 'Kathryn');
>>> {
minimum: 3,
similarity: 4,
codex: ['CTHRN', 'KTHRYN'],
matching: true
}
```

*Output*

**minimum**`number`

- The minimum similarity score needed for the names to be considered similar.**similarity**`number`

- The similarity score.**codex**`array`

- Both codex produced by`talisman/phonetics/mra`

.**matching**`boolean`

- Whether the comparison deemed both names similar or not.

## overlap

Reference: https://en.wikipedia.org/wiki/Overlap_coefficient

Computes the overlap coefficient between two sequences.

```
import overlap from 'talisman/metrics/distance/overlap';
overlap('abc', 'abcde');
>>> 1
```

## sorensen

The `sorensen`

module is just an alias of the dice one.

## tversky

Reference: https://en.wikipedia.org/wiki/Tversky_index

*Tversky, Amos (1977). “Features of Similarity”. Psychological Reviews 84 (4): 327–352.*

Computes the Tversky index between two strings.

The Tversky index can be seen as a generalization of both the Jaccard index and the Dice coefficient.

```
import tversky from 'talisman/metrics/distance/tversky';
const parameters = {
symmetric: false,
alpha: 1,
beta: 1
};
// Tversky with alpha = beta = 1 is the Jaccard index
const jaccard = tversky.bind(null, parameters);
jaccard('context', 'contact');
>>> ~0.57
```

*Parameters*

**[symmetric]**`boolean`

(false) - whether to compute the symmetric or the asymmetric index.**[alpha]**`number`

(1) - alpha parameter. Must be >= 0.**[beta]**`number`

(1) - beta parameter. Must be >= 0.