Metrics
The metrics
module typically gathers various distance and similarity functions.
They range from computing the edit distance between two strings to retrieving the distance between two points in space.
Note: in all the testers below, know that you can separate sequences’ elements by using a comma if you want to compare things different from single strings.
Summary
Modules under the talisman/metrics
namespace:
- bag
- canberra
- chebyshev
- cosine
- damerau-levenshtein
- dice
- euclidean
- eudex
- guth
- hamming
- identity
- jaccard
- jaro
- jaro-winkler
- lcs
- length
- levenshtein
- lig
- manhattan
- minkowski
- mlipns
- monge-elkan
- mra
- overlap
- prefix
- ratcliff-obershelp
- sift4
- sorensen
- suffix
- tversky
Use case
Edit distance
Let’s assume we want to determine whether two strings are similar or not.
For instance, let’s compare healed & sealed.
They look somewhat the same, don’t they?
Well, a computer would have a hard time stating this fact, since, at the end of the day, it only dabbles in mathematics.
That’s why we have to use distance metrics to assess whether both words are similar.
In the example below, we used the Dice coefficient, rating the similarity of two words and ranging from 0 to 1:
// Let's use the Dice coefficient
import dice from 'talisman/metrics/dice';
// We'll say two strings are similar if their Dice coefficient >= 0.8
function compare(a, b) {
return dice(a, b) >= 0.8;
}
compare('healed', 'sealed');
>>> true
compare('healed', 'sold');
>>> false
bag
Reference: http://www-db.disi.unibo.it/research/papers/SPIRE02.pdf
String Matching with Metric Trees Using an Approximate Distance. Ilaria Bartolini, Paolo Ciaccia, and Marco Patella.
The bag distance is the maximum of the size of both differences of two multisets.
import bag from 'talisman/metrics/bag';
bag('Niall', 'Neil');
>>> 2
canberra
Reference: https://en.wikipedia.org/wiki/Canberra_distance
The Canberra distance is a weighted version of the Manhattan distance.
import canberra from 'talisman/metrics/canberra';
canberra([1, 3], [4, 5]);
>>> 0.85
chebyshev
Reference: https://en.wikipedia.org/wiki/Chebyshev_distance
The Chebyshev distance between two vectors is the greatest of their differences along any coordinate dimension.
import chebyshev from 'talisman/metrics/chebyshev';
chebyshev([1, 3], [4, 5]);
>>> 3
cosine
Reference: https://en.wikipedia.org/wiki/Cosine_similarity
Computes the cosine similarity between two vectors of same dimension.
import cosine from 'talisman/metrics/cosine';
// Alternatively
import {similarity, distance} from 'talisman/metrics/cosine';
cosine([1, 3], [4, 5]);
>>> 0.94
distance(a, b) === 1 - similarity(a, b);
damerau-levenshtein
Reference: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance
The Damerau-Levenshtein is an improvement over the classical Levenshtein distance.
That is to say this variant will handle transpositions as well as the usual additions, deletions & substitions.
import damerauLevenshtein from 'talisman/metrics/damerau-levenshtein';
damerauLevenshtein('this', 'tihs');
>>> 1
dice
Reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient
Dice, Lee R. (1945). “Measures of the Amount of Ecologic Association Between Species”. Ecology 26 (3): 297–302.
Computes the Dice coefficient between two sequences, usually strings.
Note that the Dice & Sorensen coefficients are the same thing and that you remain free to use the module you prefer.
import dice from 'talisman/metrics/dice';
// Alternatively
import {
index,
coefficient,
similarity,
distance
} from 'talisman/metrics/dice';
dice('healed', 'sealed');
>>> 0.8
distance(a, b) === 1 - similarity(a, b)
euclidean
Reference: https://en.wikipedia.org/wiki/Euclidean_distance
Computes the euclidean distance between two points in a N-dimensions space.
This distance is equal to the length of the straight line drawn between both points.
import euclidean from 'talisman/metrics/euclidean';
// Alternatively
import {squared} from 'talisman/metrics/euclidean';
euclidean([1, 3], [4, 5]);
>>> ~3.61
// The squared distance is sometimes used to same some computation
squared([1, 3], [4, 5]);
>>> 13
Euclidean distance
Squared euclidean distance
eudex
Reference: https://github.com/ticki/eudex
Author: ticki
Functions tailored to compare two string using Eudex hashes to assess whether they seem similar or not.
import {distance, isSimilar} from 'talisman/metrics/eudex';
distance('jumbo', 'jumpo');
>>> 2
isSimilar('jumbo', 'jumpo');
>>> true
isSimilar('hello', 'world');
>>> false
Distance
Similarity
hamming
Reference: https://en.wikipedia.org/wiki/Hamming_distance
Hamming, Richard W. (1950), “Error detecting and error correcting codes”, Bell System Technical Journal 29 (2): 147–160
Computes the Hamming distance between two equal-length sequences.
import hamming from 'talisman/metrics/hamming';
hamming('night', 'nacht');
>>> 2
hamming([1, 0, 1, 1, 1, 0, 1], [1, 0, 0, 1, 0, 0, 1]);
>>> 2
guth
Gloria J. A. Guth (1976) Surname Spellings and Computerized Record Linkage, Historical Methods Newsletter, 10:1, 10-19 DOI: 10.1080/00182494.1976.10112645
The Guth distance function basically perfoms a linear scan of both strings while allowing a small window of error. This is quite close to a linear time approximation of the Damerau-Levenshtein distance with a threshold of 1 or 2.
import guth from 'talisman/metrics/guth';
guth('SMITH', 'SMYSS');
>>> 3
identity
The identity similarity is simply 1
if given strings are identical and 0
if they are not.
The module also exports a similarity function which is just the reverse, for convenience.
import {distance, similarity} from 'talisman/metrics/identity';
distance('Niall', 'Neil');
>>> 1
similarity('Niall', 'Neil');
>>> 0
jaccard
Reference: https://en.wikipedia.org/wiki/Jaccard_index
Jaccard, Paul (1912), “The distribution of the flora in the alpine zone”, New Phytologist 11: 37–50
Computes the Jaccard index between two sequences, usually strings.
import jaccard from 'talisman/metrics/jaccard';
// Alternatively
import {
index,
similarity,
distance
} from 'talisman/metrics/jaccard';
index === similarity === jaccard
jaccard('context', 'contact');
>>> ~0.57
distance(a, b) === 1 - similarity(a, b)
jaro
Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Jaro, M. A. (1989). “Advances in record linkage methodology as applied to the 1985 census of Tampa Florida”. Journal of the American Statistical Association 84 (406): 414–20
Jaro, M. A. (1995). “Probabilistic linkage of large public health data file”. Statistics in Medicine 14 (5–7): 491–8.
Computes the Jaro distance between two sequences, usually strings.
See also the Jaro-Winkler distance.
import jaro from 'talisman/metrics/jaro';
// Alternatively
import {
similarity,
distance
} from 'talisman/metrics/jaro';
similarity === jaro
jaro('Duane', 'Dwayne');
>>> 0.82
distance(a, b) === 1 - similarity(a, b)
jaro-winkler
Computes the Jaro-Winkler distance between two sequences, usually strings.
Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354–359.
import jaroWinkler from 'talisman/metrics/jaro-winkler';
// Alternatively
import {
similarity,
distance
} from 'talisman/metrics/jaro-winkler';
similarity === jaroWinkler
jaroWinkler('Duane', 'Dwayne');
>>> 0.84
distance(a, b) === 1 - similarity(a, b)
Using custom parameters
import {custom} from 'talisman/metrics/jaro-winkler';
const parameters = {
boostThreshold: 0.7,
scalingFactor: 0.1
};
custom(parameters, 'Duane', 'Dwayne');
>>> 0.84
Options are the following:
- [boostThreshold]
number
(0.7): boost threshold comprised between 0 and 1. - [scalingFactor]
number
(0.1): scaling factor. Should not exceed 0.25.
lcs
The LCS similarity is the ratio between the length of the Longest Common Subsequence of both strings and the length of the longest string.
import {distance, similarity} from 'talisman/metrics/lcs';
similarity('aluminum', 'Catalan');
>>> 0.25
distance('aluminum', 'Catalan');
>>> 0.75
length
The length similarity is the ratio between the length of the shortest string and the length of the longest one.
The module also exports a similarity function which is just the reverse, for convenience.
import {distance, similarity} from 'talisman/metrics/length';
distance('Niall', 'Neil');
>>> 0.2
similarity('Niall', 'Neil');
>>> 0.8
levenshtein
Reference: https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein, Vladimir I. (February 1966). “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.
The Levenshtein distance is probably the most known metric dealing with edit distance.
This function therefore computes the absolute edit distance between two sequences, usually strings.
import levenshtein from 'talisman/metrics/levenshtein';
levenshtein('book', 'back');
>>> 2
lig
An Interface for Mining Genealogical Nominal Data Using the Concept of linkage and a Hybrid Name Matching Algorithm. Chakkrit Snae, Bernard Diaz Department of Computer Science, The University of Liverpool Peach Street, Liverpool, UK, L69 7ZF
The LIG2 and LIG3 similarity functions are both attempts to normalize the Levenshtein distance.
import {lig2, lig3} from 'talisman/metrics/lig';
lig2('Glavin', 'Glawyn');
>>> 0.67
lig3('Glavin', 'Glawyn');
>>> 0.80
manhattan
Reference: https://en.wikipedia.org/wiki/Taxicab_geometry
Computes the Manhattan distance between two points in a N-dimensions space.
This distance if also often called the “city block” distance because it won’t draw a straight line between both points but rather follow the other edges of the triangle.
import manhattan from 'talisman/metrics/manhattan';
manhattan([1, 3], [4, 5]);
>>> 5
minkowski
Reference: https://en.wikipedia.org/wiki/Minkowski_distance
The Minkowski distance is a generalization of both the euclidean & the Manhattan distance.
import minkowski from 'talisman/metrics/minkowski';
minkowski(1, [1, 3], [4, 5]);
>>> 5
const manhattan = minkowski.bind(null, 1);
const euclidean = minkowski.bind(null, 2);
Arguments
- p
number
- value for p. Must >= 1. - a
array
- the first vector. - b
array
- the second vector.
mlipns
Reference: http://www.sial.iias.spb.su/files/386-386-1-PB.pdf
Shannaq, Boumedyen A. N. and Victor V. Alexandrov. 2010. “Using Product Similarity for Adding Business.” Global Journal of Computer Science and Technology. 10(12). 2-8.
Modified Language-Independent Product Name Search similarity function. It will only return 0
or 1
.
import mlipns, {custom} from 'talisman/metrics/mlipns';
mlipns('Niall', 'Neil');
>>> 0
// It is possible to customize the function
custom({threshold: 0.3, maxMismatches: 3}, 'Niall', 'Neil');
Custom options
- [threshold]
boolean
(0.25) - threshold parameter, i.e. fraction of total length the function will search for transpositions. - [maxMismatches]
number
(2) - maximum number of mismatches to tolerate.
monge-elkan
Reference: http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf
Monge, Alvaro E. and Charles P. Elkan. 1996. “The field matching problem: Algorithms and applications.” KDD-9 Proceedings.
The Monge-Elkan function is a higher-order similarity function returning the averaged maximum of token to token similarity.
The module also exports a symmetric version of the procedure that is basically the mean of applying the function on A & B, then on B & A.
Note that both functions can also take arbitrary sequences given as arrays.
import identity from 'talisman/metrics/identity';
import mongeElkan, {symmetric} from 'talisman/metrics/monge-elkan';
mongeElkan(identity, 'A', 'AB');
>>> 1
symmetric(identity, 'AB', 'A');
>>> 0.75
Symmetric version
mra
Reference: https://en.wikipedia.org/wiki/Match_rating_approach
Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.
Assesses the similarity between two names by using the Match Rating Approach comparison.
Note that if the difference between the strings’ codex’ lengths is superior to 3, the comparison won’t be done as the algorithm will deem the strings too different.
import mra from 'talisman/metrics/mra';
mra('Catherine', 'Kathryn');
>>> {
minimum: 3,
similarity: 4,
codex: ['CTHRN', 'KTHRYN'],
matching: true
}
Output
- minimum
number
- The minimum similarity score needed for the names to be considered similar. - similarity
number
- The similarity score. - codex
array
- Both codex produced bytalisman/phonetics/mra
. - matching
boolean
- Whether the comparison deemed both names similar or not.
overlap
Reference: https://en.wikipedia.org/wiki/Overlap_coefficient
Computes the overlap coefficient between two sequences.
import overlap from 'talisman/metrics/overlap';
overlap('abc', 'abcde');
>>> 1
prefix
The prefix similarity is the ratio between the number of characters matching at the beginning of both string and the length of the shortest string.
The module also exports a similarity function which is just the reverse, for convenience.
import {distance, similarity} from 'talisman/metrics/prefix';
distance('Niall', 'Neil');
>>> 0.75
similarity('Niall', 'Neil');
>>> 0.25
ratcliff-obershelp
Reference: https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html
PATTERN MATCHING: THE GESTALT APPROACH - John W. Ratcliff, David E. Metzener, Paul E. Black, “Ratcliff/Obershelp pattern recognition”, in Dictionary of Algorithms and Data Structures [online], Vreda Pieterse and Paul E. Black, eds. 17 December 2004.
The Ratcliff-Obershelp similarity is a normalized number of matches between both strings, matches being found by identifying the longest common subsquence between two strings then recursively applying the same logic on the left and right of this common subsequence.
import {similarity, distance} from 'talisman/metrics/ratcliff-obershelp';
similarity('mathematics', 'matematica');
>>> ~0.86
distance('mathematics', 'matematica');
>>> ~0.14
sift4
The SIFT4 distance function, invented by Siderite Zackwehdex, is a linear time approximation of the Damerau-Levenshtein distance.
import sift4, {custom} from 'talisman/metrics/sift4';
sift4('cat', 'hat');
>>> 1
// It is possible to customize the function
custom({
symmetric: true,
transpositions: false,
maxDistance: 2,
maxOffset: 5
}, 'cat', 'hat');
Custom options
- [symmetric]
boolean
(false) - whether to ensure the function’s output will be symmetric. - [transpositions]
boolean
(true) - whether to allow transpositions, like with the Damerau-Levenshtein distance. - [maxDistance]
number
(undefined) - whether to break computations as soon as this maximum distance is reached. - [maxOffset]
number
(5) - maximum window size to search for editions. Bigger numbers will hurt performance.
sorensen
The sorensen
module is just an alias of the dice one.
suffix
The suffix similarity is the ratio between the number of characters matching at the end of both string and the length of the shortest string.
The module also exports a similarity function which is just the reverse, for convenience.
import {distance, similarity} from 'talisman/metrics/suffix';
distance('Niall', 'Neil');
>>> 0.75
similarity('Niall', 'Neil');
>>> 0.25
tversky
Reference: https://en.wikipedia.org/wiki/Tversky_index
Tversky, Amos (1977). “Features of Similarity”. Psychological Reviews 84 (4): 327–352.
Computes the Tversky index between two strings.
The Tversky index can be seen as a generalization of both the Jaccard index and the Dice coefficient.
import tversky from 'talisman/metrics/tversky';
const parameters = {
symmetric: false,
alpha: 1,
beta: 1
};
// Tversky with alpha = beta = 1 is the Jaccard index
const jaccard = tversky.bind(null, parameters);
jaccard('context', 'contact');
>>> ~0.57
Parameters
- [symmetric]
boolean
(false) - whether to compute the symmetric or the asymmetric index. - [alpha]
number
(1) - alpha parameter. Must be >= 0. - [beta]
number
(1) - beta parameter. Must be >= 0.