Phonetics

Reference: https://en.wikipedia.org/wiki/Phonetic_algorithm

The phonetics module aims at gathering various algorithms whose goal is to produce an approximative phonetic representation of the given strings.

This phonetic representation is then really useful when performing fuzzy matching.

The algorithms presented in this page generally work for the English language (even if they somewhat extend to a variety of european languages for some of them).

This said, the library also offers phonetic algorithms targeting other languages, such as French for instance.

Summary

Modules under the talisman/phonetics namespace:

alpha-sis
caverphone
daitch-mokotoff
double-metaphone
eudex
fuzzy-soundex
lein
metaphone
mra
nysiis
onca
phonex
roger-root
sound-d
soundex
statcan

Phonetic algorithms for other languages

Use case

Let’s say we want to compare two fairly similar names like Catherine & Kathryn.

One human would very easily agree that those two names do sound the same.

But, for a computer, stating this simple fact is daunting since:

'Catherine' !== 'Kathryn'

Phonetic algorithms are therefore a way to solve this problem because they will try to produce a phonetic representation of the given strings that can be used to match them if they sound roughly the same.

// Using the metaphone algorithm, for instance
import metaphone from 'talisman/phonetics/metaphone';

const catherineCode = metaphone('Catherine'),
      kathrynCode = metaphone('Kathryn');

catherineCode
>>> 'K0RN'

kathrynCode
>>> 'K0RN'

catherineCode === kathrynCode
>>> true

alpha-sis

Reference: https://archive.org/stream/accessingindivid00moor#page/15/mode/1up

Accessing individual records from personal data files using non-unique identifiers” / Gwendolyn B. Moore, et al.; prepared for the Institute for Computer Sciences and Technology, National Bureau of Standards, Washington, D.C (1977)

This algorithm, from IBM’s Alpha Search Inquiry System (Alpha SIS), produces 14 characters-long Soundex-like codes.

Note that it will return a list rather than a single code because it will try to encode some characters sequences, such as “DZ” for instance, using two or three possibilities (and all permutations are thusly returned).

import alphaSis from 'talisman/phonetics/alpha-sis';

alphaSis('Rogers');
>>> ['04740000000000']

caverphone

Reference: https://en.wikipedia.org/wiki/Caverphone

Original algorithm: http://caversham.otago.ac.nz/files/working/ctp060902.pdf

Revisited algorithm: http://caversham.otago.ac.nz/files/working/ctp150804.pdf

The Caversham project: http://caversham.otago.ac.nz/

The caverphone algorithm, written by David Hood for the Caversham project, aims at encoding names and specifically targeting names from New Zealand.

However, this shouldn’t stop you from trying it on any dataset.

The library packs both the original & the revisited version of the algorithm.

import caverphone from 'talisman/phonetics/caverphone';
// Alternatively
import {original, revisited} from 'talisman/phonetics/caverphone';

caverphone === original
>>> true

caverphone('Henrichsen');
>>> 'ANRKSN1111'

revisited('Henrichsen')
>>> 'ANRKSN1111'

Original version

Revisited version

daitch-mokotoff

Reference: https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex

The Daitch-Mokotoff Soundex is a refinement of the American Soundex to match more properly Slavic & Yiddish names.

Note that sometimes, this algorithm give different solutions for encoding a sound.

Thus, the function will always return an array of possible encodings listing all the possible permutations (at least one, obviously).

import daitchMokotoff from 'talisman/phonetics/daitch-mokotoff';

daitchMokotoff('Peters');
>>> ['739400', '734000']

double-metaphone

Reference: https://en.wikipedia.org/wiki/Metaphone

The double metaphone algorithm, created in 2000 by Lawrence Philips, is an improvement over the original metaphone algorithm.

It is called “double” because the algorithm will try to produce two possibilities for the phonetic encoding of the given string.

Note however, that unlike the original metaphone, the length of the produced code will never exceed 4 characters.

import doubleMetaphone from 'talisman/phonetics/double-metaphone';

doubleMetaphone('Smith');
>>> ['SM0', 'XMT']

eudex

Reference: https://github.com/ticki/eudex

Author: ticki

Eudex is a phonetic hashing algorithm that will produce a 64bits integer holding information about the given word.

The produced hashed can be used afterwards by specific distance metrics to determine whether two given words seem phonetically similar or not.

Important: this function will return a 64bits integer wrapped in a Long object from the long node library since JavaScript is natively unable to deal with such integers.

import eudex from 'talisman/phonetics/eudex';

eudex('Guillaume');
>>> <Long>288230378836066816

fuzzy-soundex

Reference: http://wayback.archive.org/web/20100629121128/http://www.ir.iit.edu/publications/downloads/IEEESoundexV5.pdf

Holmes, David and M. Catherine McCabe. “Improving Precision and Recall for Soundex Retrieval.”

This algorithm is designed as an improvement over the classical Soundex.

This improvement is achieved by performing some substitutions in the style of what the NYSIIS algorithm does, plus fuzzying some name beginnings & endings.

import fuzzySoundex from 'talisman/phonetics/fuzzy-soundex';

fuzzySoundex('Rogers');
>>> 'R769'

lein

Reference: http://naldc.nal.usda.gov/download/27833/PDF

The Lein name coding procedure is a Soundex-like algorithm than will produce a 4-character code for the given name.

import lein from 'talisman/phonetics/lein';

lein('Michael');
>>> 'M530'

metaphone

Reference: https://en.wikipedia.org/wiki/Metaphone

The metaphone algorithm, created in 1990 by Lawrence Philips, is a phonetic algorithm working on dictionary words (rather than only processing names, as phonetic algorithms usually do).

Note also that the algorithm will not truncate the given word to output a codex limited to a specific number of letters.

Today, however, we often prefer to use the “improved” version of the algorithm called the double metaphone.

import metaphone from 'talisman/phonetics/metaphone';

metaphone('Michael');
>>> 'MXL'

mra

Reference: https://en.wikipedia.org/wiki/Match_rating_approach

Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.

This algorithm will compute the Match Rating Approach codex used by the same method to establish the similarity between two names.

This function is exported by the library for reference, but you should probably use talisman/metrics/mra instead.

import mra from 'talisman/phonetics/mra';

mra('Kathryn');
>>> 'KTHRYN'

nysiis

Reference: https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System

The New York State Identification and Intelligence System is basically a more modern alternative to the original Soundex.

Like its counterpart, it has been created to match names and is not really suited for dictionary words.

The library packs both the original and the refined version of the algorithm.

import nysiis from 'talisman/phonetics/nysiis';
// Alternatively
import {original, refined} from 'talisman/phonetics/nysiis';

nysiis === original
>>> true

nysiis('Philbert');
>>> 'FFALBAD'

nysiis('Philbert');
>>> 'FALBAD'

Original version

Refined version

onca

The Oxford Name Compression Algorithm (ONCA).

Basically a glorified combination of the NYSIIS algorithm & the Soundex one.

import onca from 'talisman/phonetics/onca';

lein('Dionne');
>>> 'D500'

phonex

Reference: http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf

Lait, A. J. and B. Randell. “An Assessment of Name Matching Algorithms”.

This algorithm is an improved version of the Soundex algorithm.

Its main change is to better fuzz some very common cases missed by the Soundex algorithm in order to match more orthographic variations.

import phonex from 'talisman/phonetics/phonex';

phonex('Rogers');
>>> 'R26'

roger-root

Reference: http://naldc.nal.usda.gov/download/27833/PDF

The Roger Root name coding procedure is a Soundex-like algorithm than will produce a 5-character code (completely numerical) for the given name.

Its specificity is to encode the beginning of the names differently than their rest.

import rogerRoot from 'talisman/phonetics/roger-root';

rogerRoot('Michael');
>>> '03650'

sound-d

Hybrid Matching Algorithm for Personal Names. Cihan Varol, Coskun Bayrak.

The SoundD algorithm is a slight variant of the Soundex algorithm.

import soundD from 'talisman/phonetics/sound-d';

soundD('Martha');
>>> '5630'

soundex

Reference: https://en.wikipedia.org/wiki/Soundex

The Soundex algorithm, created by Robert Russell and Margaret Odell, is often considered to be the first phonetic algorithm in history.

Note that it aims at matching anglo-saxons names and won’t work well on dictionary words.

You are also free to use the refined version of this algorithm, as found in the Apache projects.

import soundex from 'talisman/phonetics/soundex';
// Alternatively
import {refined} from 'talisman/phonetics/soundex';

soundex('Michael');
>>> 'M240'

refined('Michael');
>>> 'M80307'

Original version

Refined version

statcan

Reference: http://naldc.nal.usda.gov/download/27833/PDF

The census modified statistics Canada name coding procedure is a Soundex-like algorithm than will produce a 4-character code (alphabetical) for the given name.

import statcan from 'talisman/phonetics/statcan';

statcan('Michael');
>>> 'MCHL'