Metrics

The metrics module typically gathers various distance and similarity functions.

They range from computing the edit distance between two strings to retrieving the distance between two points in space.

Note: in all the testers below, know that you can separate sequences’ elements by using a comma if you want to compare things different from single strings.

Summary

Modules under the talisman/metrics namespace:

bag
canberra
chebyshev
cosine
damerau-levenshtein
dice
euclidean
eudex
guth
hamming
identity
jaccard
jaro
jaro-winkler
lcs
length
levenshtein
lig
manhattan
minkowski
mlipns
monge-elkan
mra
overlap
prefix
ratcliff-obershelp
sift4
sorensen
suffix
tversky

Use case

Edit distance

Let’s assume we want to determine whether two strings are similar or not.

For instance, let’s compare healed & sealed.

They look somewhat the same, don’t they?

Well, a computer would have a hard time stating this fact, since, at the end of the day, it only dabbles in mathematics.

That’s why we have to use distance metrics to assess whether both words are similar.

In the example below, we used the Dice coefficient, rating the similarity of two words and ranging from 0 to 1:

// Let's use the Dice coefficient
import dice from 'talisman/metrics/dice';

// We'll say two strings are similar if their Dice coefficient >= 0.8
function compare(a, b) {
  return dice(a, b) >= 0.8;
}

compare('healed', 'sealed');
>>> true

compare('healed', 'sold');
>>> false

bag

Reference: http://www-db.disi.unibo.it/research/papers/SPIRE02.pdf

String Matching with Metric Trees Using an Approximate Distance. Ilaria Bartolini, Paolo Ciaccia, and Marco Patella.

The bag distance is the maximum of the size of both differences of two multisets.

import bag from 'talisman/metrics/bag';

bag('Niall', 'Neil');
>>> 2

canberra

Reference: https://en.wikipedia.org/wiki/Canberra_distance

The Canberra distance is a weighted version of the Manhattan distance.

import canberra from 'talisman/metrics/canberra';

canberra([1, 3], [4, 5]);
>>> 0.85

chebyshev

Reference: https://en.wikipedia.org/wiki/Chebyshev_distance

The Chebyshev distance between two vectors is the greatest of their differences along any coordinate dimension.

import chebyshev from 'talisman/metrics/chebyshev';

chebyshev([1, 3], [4, 5]);
>>> 3

cosine

Reference: https://en.wikipedia.org/wiki/Cosine_similarity

Computes the cosine similarity between two vectors of same dimension.

import cosine from 'talisman/metrics/cosine';
// Alternatively
import {similarity, distance} from 'talisman/metrics/cosine';

cosine([1, 3], [4, 5]);
>>> 0.94

distance(a, b) === 1 - similarity(a, b);

damerau-levenshtein

Reference: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The Damerau-Levenshtein is an improvement over the classical Levenshtein distance.

That is to say this variant will handle transpositions as well as the usual additions, deletions & substitions.

import damerauLevenshtein from 'talisman/metrics/damerau-levenshtein';

damerauLevenshtein('this', 'tihs');
>>> 1

dice

Reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Dice, Lee R. (1945). “Measures of the Amount of Ecologic Association Between Species”. Ecology 26 (3): 297–302.

Computes the Dice coefficient between two sequences, usually strings.

Note that the Dice & Sorensen coefficients are the same thing and that you remain free to use the module you prefer.

import dice from 'talisman/metrics/dice';
// Alternatively
import {
  index,
  coefficient,
  similarity,
  distance
} from 'talisman/metrics/dice';

dice('healed', 'sealed');
>>> 0.8

distance(a, b) === 1 - similarity(a, b)

euclidean

Reference: https://en.wikipedia.org/wiki/Euclidean_distance

Computes the euclidean distance between two points in a N-dimensions space.

This distance is equal to the length of the straight line drawn between both points.

import euclidean from 'talisman/metrics/euclidean';
// Alternatively
import {squared} from 'talisman/metrics/euclidean';

euclidean([1, 3], [4, 5]);
>>> ~3.61

// The squared distance is sometimes used to same some computation
squared([1, 3], [4, 5]);
>>> 13

Euclidean distance

Squared euclidean distance

eudex

Reference: https://github.com/ticki/eudex

Author: ticki

Functions tailored to compare two string using Eudex hashes to assess whether they seem similar or not.

import {distance, isSimilar} from 'talisman/metrics/eudex';

distance('jumbo', 'jumpo');
>>> 2

isSimilar('jumbo', 'jumpo');
>>> true

isSimilar('hello', 'world');
>>> false

Distance

Similarity

hamming

Reference: https://en.wikipedia.org/wiki/Hamming_distance

Hamming, Richard W. (1950), “Error detecting and error correcting codes”, Bell System Technical Journal 29 (2): 147–160

Computes the Hamming distance between two equal-length sequences.

import hamming from 'talisman/metrics/hamming';

hamming('night', 'nacht');
>>> 2

hamming([1, 0, 1, 1, 1, 0, 1], [1, 0, 0, 1, 0, 0, 1]);
>>> 2

guth

Gloria J. A. Guth (1976) Surname Spellings and Computerized Record Linkage, Historical Methods Newsletter, 10:1, 10-19 DOI: 10.1080/00182494.1976.10112645

The Guth distance function basically perfoms a linear scan of both strings while allowing a small window of error. This is quite close to a linear time approximation of the Damerau-Levenshtein distance with a threshold of 1 or 2.

import guth from 'talisman/metrics/guth';

guth('SMITH', 'SMYSS');
>>> 3

identity

The identity similarity is simply 1 if given strings are identical and 0 if they are not.

The module also exports a similarity function which is just the reverse, for convenience.

import {distance, similarity} from 'talisman/metrics/identity';

distance('Niall', 'Neil');
>>> 1

similarity('Niall', 'Neil');
>>> 0

jaccard

Reference: https://en.wikipedia.org/wiki/Jaccard_index

Jaccard, Paul (1912), “The distribution of the flora in the alpine zone”, New Phytologist 11: 37–50

Computes the Jaccard index between two sequences, usually strings.

import jaccard from 'talisman/metrics/jaccard';
// Alternatively
import {
  index,
  similarity,
  distance
} from 'talisman/metrics/jaccard';

index === similarity === jaccard

jaccard('context', 'contact');
>>> ~0.57

distance(a, b) === 1 - similarity(a, b)

jaro

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Jaro, M. A. (1989). “Advances in record linkage methodology as applied to the 1985 census of Tampa Florida”. Journal of the American Statistical Association 84 (406): 414–20

Jaro, M. A. (1995). “Probabilistic linkage of large public health data file”. Statistics in Medicine 14 (5–7): 491–8.

Computes the Jaro distance between two sequences, usually strings.

jaro-winkler

Computes the Jaro-Winkler distance between two sequences, usually strings.

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354–359.

import jaroWinkler from 'talisman/metrics/jaro-winkler';
// Alternatively
import {
  similarity,
  distance
} from 'talisman/metrics/jaro-winkler';

similarity === jaroWinkler

jaroWinkler('Duane', 'Dwayne');
>>> 0.84

distance(a, b) === 1 - similarity(a, b)

Using custom parameters

import {custom} from 'talisman/metrics/jaro-winkler';

const parameters = {
  boostThreshold: 0.7,
  scalingFactor: 0.1
};

custom(parameters, 'Duane', 'Dwayne');
>>> 0.84

Options are the following:

[boostThreshold] number (0.7): boost threshold comprised between 0 and 1.
[scalingFactor] number (0.1): scaling factor. Should not exceed 0.25.

lcs

The LCS similarity is the ratio between the length of the Longest Common Subsequence of both strings and the length of the longest string.

import {distance, similarity} from 'talisman/metrics/lcs';

similarity('aluminum', 'Catalan');
>>> 0.25

distance('aluminum', 'Catalan');
>>> 0.75

length

The length similarity is the ratio between the length of the shortest string and the length of the longest one.

The module also exports a similarity function which is just the reverse, for convenience.

import {distance, similarity} from 'talisman/metrics/length';

distance('Niall', 'Neil');
>>> 0.2

similarity('Niall', 'Neil');
>>> 0.8

levenshtein

Reference: https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein, Vladimir I. (February 1966). “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.

The Levenshtein distance is probably the most known metric dealing with edit distance.

This function therefore computes the absolute edit distance between two sequences, usually strings.

import levenshtein from 'talisman/metrics/levenshtein';

levenshtein('book', 'back');
>>> 2

lig

An Interface for Mining Genealogical Nominal Data Using the Concept of linkage and a Hybrid Name Matching Algorithm. Chakkrit Snae, Bernard Diaz Department of Computer Science, The University of Liverpool Peach Street, Liverpool, UK, L69 7ZF

The LIG2 and LIG3 similarity functions are both attempts to normalize the Levenshtein distance.

import {lig2, lig3} from 'talisman/metrics/lig';

lig2('Glavin', 'Glawyn');
>>> 0.67


lig3('Glavin', 'Glawyn');
>>> 0.80

manhattan

Reference: https://en.wikipedia.org/wiki/Taxicab_geometry

Computes the Manhattan distance between two points in a N-dimensions space.

This distance if also often called the “city block” distance because it won’t draw a straight line between both points but rather follow the other edges of the triangle.

import manhattan from 'talisman/metrics/manhattan';

manhattan([1, 3], [4, 5]);
>>> 5

minkowski

Reference: https://en.wikipedia.org/wiki/Minkowski_distance

The Minkowski distance is a generalization of both the euclidean & the Manhattan distance.

import minkowski from 'talisman/metrics/minkowski';

minkowski(1, [1, 3], [4, 5]);
>>> 5

const manhattan = minkowski.bind(null, 1);
const euclidean = minkowski.bind(null, 2);

Arguments

p number - value for p. Must >= 1.
a array - the first vector.
b array - the second vector.

mlipns

Reference: http://www.sial.iias.spb.su/files/386-386-1-PB.pdf

Shannaq, Boumedyen A. N. and Victor V. Alexandrov. 2010. “Using Product Similarity for Adding Business.” Global Journal of Computer Science and Technology. 10(12). 2-8.

Modified Language-Independent Product Name Search similarity function. It will only return 0 or 1.

import mlipns, {custom} from 'talisman/metrics/mlipns';

mlipns('Niall', 'Neil');
>>> 0

// It is possible to customize the function
custom({threshold: 0.3, maxMismatches: 3}, 'Niall', 'Neil');

Custom options

[threshold] boolean (0.25) - threshold parameter, i.e. fraction of total length the function will search for transpositions.
[maxMismatches] number (2) - maximum number of mismatches to tolerate.

monge-elkan

Reference: http://www.aaai.org/Papers/KDD/1996/KDD96-044.pdf

Monge, Alvaro E. and Charles P. Elkan. 1996. “The field matching problem: Algorithms and applications.” KDD-9 Proceedings.

The Monge-Elkan function is a higher-order similarity function returning the averaged maximum of token to token similarity.

The module also exports a symmetric version of the procedure that is basically the mean of applying the function on A & B, then on B & A.

Note that both functions can also take arbitrary sequences given as arrays.

import identity from 'talisman/metrics/identity';
import mongeElkan, {symmetric} from 'talisman/metrics/monge-elkan';

mongeElkan(identity, 'A', 'AB');
>>> 1

symmetric(identity, 'AB', 'A');
>>> 0.75

Symmetric version

mra

Reference: https://en.wikipedia.org/wiki/Match_rating_approach

Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.

Assesses the similarity between two names by using the Match Rating Approach comparison.

Note that if the difference between the strings’ codex’ lengths is superior to 3, the comparison won’t be done as the algorithm will deem the strings too different.

import mra from 'talisman/metrics/mra';

mra('Catherine', 'Kathryn');
>>> {
  minimum: 3,
  similarity: 4,
  codex: ['CTHRN', 'KTHRYN'],
  matching: true
}

Output

minimum number - The minimum similarity score needed for the names to be considered similar.
similarity number - The similarity score.
codex array - Both codex produced by talisman/phonetics/mra.
matching boolean - Whether the comparison deemed both names similar or not.

overlap

Reference: https://en.wikipedia.org/wiki/Overlap_coefficient

Computes the overlap coefficient between two sequences.

import overlap from 'talisman/metrics/overlap';

overlap('abc', 'abcde');
>>> 1

prefix

The prefix similarity is the ratio between the number of characters matching at the beginning of both string and the length of the shortest string.

The module also exports a similarity function which is just the reverse, for convenience.

import {distance, similarity} from 'talisman/metrics/prefix';

distance('Niall', 'Neil');
>>> 0.75

similarity('Niall', 'Neil');
>>> 0.25

ratcliff-obershelp

Reference: https://xlinux.nist.gov/dads/HTML/ratcliffObershelp.html

PATTERN MATCHING: THE GESTALT APPROACH - John W. Ratcliff, David E. Metzener, Paul E. Black, “Ratcliff/Obershelp pattern recognition”, in Dictionary of Algorithms and Data Structures [online], Vreda Pieterse and Paul E. Black, eds. 17 December 2004.

The Ratcliff-Obershelp similarity is a normalized number of matches between both strings, matches being found by identifying the longest common subsquence between two strings then recursively applying the same logic on the left and right of this common subsequence.

import {similarity, distance} from 'talisman/metrics/ratcliff-obershelp';

similarity('mathematics', 'matematica');
>>> ~0.86

distance('mathematics', 'matematica');
>>> ~0.14

sift4

Reference: http://web.archive.org/web/20190613223908/https://siderite.blogspot.com/2014/11/super-fast-and-accurate-string-distance.html

The SIFT4 distance function, invented by Siderite Zackwehdex, is a linear time approximation of the Damerau-Levenshtein distance.

import sift4, {custom} from 'talisman/metrics/sift4';

sift4('cat', 'hat');
>>> 1

// It is possible to customize the function
custom({
  symmetric: true,
  transpositions: false,
  maxDistance: 2,
  maxOffset: 5
}, 'cat', 'hat');

Custom options

[symmetric] boolean (false) - whether to ensure the function’s output will be symmetric.
[transpositions] boolean (true) - whether to allow transpositions, like with the Damerau-Levenshtein distance.
[maxDistance] number (undefined) - whether to break computations as soon as this maximum distance is reached.
[maxOffset] number (5) - maximum window size to search for editions. Bigger numbers will hurt performance.

sorensen

The sorensen module is just an alias of the dice one.

suffix

The suffix similarity is the ratio between the number of characters matching at the end of both string and the length of the shortest string.

The module also exports a similarity function which is just the reverse, for convenience.

import {distance, similarity} from 'talisman/metrics/suffix';

distance('Niall', 'Neil');
>>> 0.75

similarity('Niall', 'Neil');
>>> 0.25

tversky

Reference: https://en.wikipedia.org/wiki/Tversky_index

Tversky, Amos (1977). “Features of Similarity”. Psychological Reviews 84 (4): 327–352.

Computes the Tversky index between two strings.

The Tversky index can be seen as a generalization of both the Jaccard index and the Dice coefficient.

import tversky from 'talisman/metrics/tversky';

const parameters = {
  symmetric: false,
  alpha: 1,
  beta: 1
};

// Tversky with alpha = beta = 1 is the Jaccard index
const jaccard = tversky.bind(null, parameters);

jaccard('context', 'contact');
>>> ~0.57

Parameters

[symmetric] boolean (false) - whether to compute the symmetric or the asymmetric index.
[alpha] number (1) - alpha parameter. Must be >= 0.
[beta] number (1) - beta parameter. Must be >= 0.