Distance Metrics

The metrics/distance module typically gathers metrics aiming at finding a theoretical “distance” between two sequences.

They range from computing the edit distance between two strings to retrieving the distance between two points in space.

Note: in all the testers below, know that you can separate sequences’ elements by using a comma if you want to compare things different from single strings.

Summary

Modules under the talisman/metrics/distance/distance namespace:

Use case

Edit distance

Let’s assume we want to determine whether two strings are similar or not.

For instance, let’s compare healed & sealed.

They look somewhat the same, don’t they?

Well, a computer would have a hard time stating this fact, since, at the end of the day, it only dabbles in mathematics.

That’s why we have to use distance metrics to assess whether both words are similar.

In the example below, we used the Dice coefficient, rating the similarity of two words and ranging from 0 to 1:

// Let's use the Dice coefficient
import dice from 'talisman/metrics/distance/dice';

// We'll say two strings are similar if their Dice coefficient >= 0.8
function compare(a, b) {
  return dice(a, b) >= 0.8;
}

compare('healed', 'sealed');
>>> true

compare('healed', 'sold');
>>> false

canberra

Reference: https://en.wikipedia.org/wiki/Canberra_distance

The Canberra distance is a weighted version of the Manhattan distance.

import canberra from 'talisman/metrics/distance/canberra';

canberra([1, 3], [4, 5]);
>>> 0.85

chebyshev

Reference: https://en.wikipedia.org/wiki/Chebyshev_distance

The Chebyshev distance between two vectors is the greatest of their differences along any coordinate dimension.

import chebyshev from 'talisman/metrics/distance/chebyshev';

chebyshev([1, 3], [4, 5]);
>>> 3

cosine

Reference: https://en.wikipedia.org/wiki/Cosine_similarity

Computes the cosine similarity between two vectors of same dimension.

import cosine from 'talisman/metrics/distance/cosine';
// Alternatively
import {similarity, distance} from 'talisman/metrics/distance/cosine';

cosine([1, 3], [4, 5]);
>>> 0.94

distance(a, b) === 1 - similarity(a, b);

damerau-levenshtein

Reference: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

The Damerau-Levenshtein is an improvement over the classical Levenshtein distance.

That is to say this variant will handle transpositions as well as the usual additions, deletions & substitions.

import damerauLevenshtein from 'talisman/metrics/distance/damerau-levenshtein';

damerauLevenshtein('this', 'tihs');
>>> 1

dice

Reference: https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

Dice, Lee R. (1945). “Measures of the Amount of Ecologic Association Between Species”. Ecology 26 (3): 297–302.

Computes the Dice coefficient between two sequences, usually strings.

Note that the Dice & Sorensen coefficients are the same thing and that you remain free to use the module you prefer.

import dice from 'talisman/metrics/distance/dice';
// Alternatively
import {
  index,
  coefficient,
  similarity,
  distance
} from 'talisman/metrics/distance/dice';

dice('healed', 'sealed');
>>> 0.8

distance(a, b) === 1 - similarity(a, b)

euclidean

Reference: https://en.wikipedia.org/wiki/Euclidean_distance

Computes the euclidean distance between two points in a N-dimensions space.

This distance is equal to the length of the straight line drawn between both points.

import euclidean from 'talisman/metrics/distance/euclidean';
// Alternatively
import {squared} from 'talisman/metrics/distance/euclidean';

euclidean([1, 3], [4, 5]);
>>> ~3.61

// The squared distance is sometimes used to same some computation
squared([1, 3], [4, 5]);
>>> 13

Euclidean distance

Squared euclidean distance

eudex

Reference: https://github.com/ticki/eudex

Author: ticki

Functions tailored to compare two string using Eudex hashes to assess whether they seem similar or not.

import {distance, isSimilar} from 'talisman/metrics/distance/eudex';

distance('jumbo', 'jumpo');
>>> 2

isSimilar('jumbo', 'jumpo');
>>> true

isSimilar('hello', 'world');
>>> false

Distance

Similarity

hamming

Reference: https://en.wikipedia.org/wiki/Hamming_distance

Hamming, Richard W. (1950), “Error detecting and error correcting codes”, Bell System Technical Journal 29 (2): 147–160

Computes the Hamming distance between two equal-length sequences.

import hamming from 'talisman/metrics/distance/hamming';

hamming('night', 'nacht');
>>> 2

hamming([1, 0, 1, 1, 1, 0, 1], [1, 0, 0, 1, 0, 0, 1]);
>>> 2

jaccard

Reference: https://en.wikipedia.org/wiki/Jaccard_index

Jaccard, Paul (1912), “The distribution of the flora in the alpine zone”, New Phytologist 11: 37–50

Computes the Jaccard index between two sequences, usually strings.

import jaccard from 'talisman/metrics/distance/jaccard';
// Alternatively
import {
  index,
  similarity,
  distance
} from 'talisman/metrics/distance/jaccard';

index === similarity === jaccard

jaccard('context', 'contact');
>>> ~0.57

distance(a, b) === 1 - similarity(a, b)

jaro

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Jaro, M. A. (1989). “Advances in record linkage methodology as applied to the 1985 census of Tampa Florida”. Journal of the American Statistical Association 84 (406): 414–20

Jaro, M. A. (1995). “Probabilistic linkage of large public health data file”. Statistics in Medicine 14 (5–7): 491–8.

Computes the Jaro distance between two sequences, usually strings.

See also the Jaro-Winkler distance.

import jaro from 'talisman/metrics/distance/jaro';
// Alternatively
import {
  similarity,
  distance
} from 'talisman/metrics/distance/jaro';

similarity === jaro

jaro('Duane', 'Dwayne');
>>> 0.82

distance(a, b) === 1 - similarity(a, b)

jaro-winkler

Computes the Jaro-Winkler distance between two sequences, usually strings.

Reference: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

Winkler, W. E. (1990). “String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage”. Proceedings of the Section on Survey Research Methods (American Statistical Association): 354–359.

import jaroWinkler from 'talisman/metrics/distance/jaro-winkler';
// Alternatively
import {
  similarity,
  distance
} from 'talisman/metrics/distance/jaro-winkler';

similarity === jaroWinkler

jaroWinkler('Duane', 'Dwayne');
>>> 0.84

distance(a, b) === 1 - similarity(a, b)

Using custom parameters

import {custom} from 'talisman/metrics/distance/jaro-winkler';

const parameters = {
  boostThreshold: 0.7,
  scalingFactor: 0.1
};

custom(parameters, 'Duane', 'Dwayne');
>>> 0.84

Options are the following:

levenshtein

Reference: https://en.wikipedia.org/wiki/Levenshtein_distance

Levenshtein, Vladimir I. (February 1966). “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.

The Levenshtein distance is probably the most known metric dealing with edit distance.

This function therefore computes the absolute edit distance between two sequences, usually strings.

import levenshtein from 'talisman/metrics/distance/levenshtein';

levenshtein('book', 'back');
>>> 2

manhattan

Reference: https://en.wikipedia.org/wiki/Taxicab_geometry

Computes the Manhattan distance between two points in a N-dimensions space.

This distance if also often called the “city block” distance because it won’t draw a straight line between both points but rather follow the other edges of the triangle.

import manhattan from 'talisman/metrics/distance/manhattan';

manhattan([1, 3], [4, 5]);
>>> 5

minkowski

Reference: https://en.wikipedia.org/wiki/Minkowski_distance

The Minkowski distance is a generalization of both the euclidean & the Manhattan distance.

import minkowski from 'talisman/metrics/distance/minkowski';

minkowski(1, [1, 3], [4, 5]);
>>> 5

const manhattan = minkowski.bind(null, 1);
const euclidean = minkowski.bind(null, 2);

Arguments

mra

Reference: https://en.wikipedia.org/wiki/Match_rating_approach

Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.

Assesses the similarity between two names by using the Match Rating Approach comparison.

Note that if the difference between the strings’ codex’ lengths is superior to 3, the comparison won’t be done as the algorithm will deem the strings too different.

import mra from 'talisman/metrics/distance/mra';

mra('Catherine', 'Kathryn');
>>> {
  minimum: 3,
  similarity: 4,
  codex: ['CTHRN', 'KTHRYN'],
  matching: true
}

Output

overlap

Reference: https://en.wikipedia.org/wiki/Overlap_coefficient

Computes the overlap coefficient between two sequences.

import overlap from 'talisman/metrics/distance/overlap';

overlap('abc', 'abcde');
>>> 1

sorensen

The sorensen module is just an alias of the dice one.

tversky

Reference: https://en.wikipedia.org/wiki/Tversky_index

Tversky, Amos (1977). “Features of Similarity”. Psychological Reviews 84 (4): 327–352.

Computes the Tversky index between two strings.

The Tversky index can be seen as a generalization of both the Jaccard index and the Dice coefficient.

import tversky from 'talisman/metrics/distance/tversky';

const parameters = {
  symmetric: false,
  alpha: 1,
  beta: 1
};

// Tversky with alpha = beta = 1 is the Jaccard index
const jaccard = tversky.bind(null, parameters);

jaccard('context', 'contact');
>>> ~0.57

Parameters