Phonetics
Reference: https://en.wikipedia.org/wiki/Phonetic_algorithm
The phonetics
module aims at gathering various algorithms whose goal is to produce an approximative phonetic representation of the given strings.
This phonetic representation is then really useful when performing fuzzy matching.
The algorithms presented in this page generally work for the English language (even if they somewhat extend to a variety of european languages for some of them).
This said, the library also offers phonetic algorithms targeting other languages, such as French for instance.
Summary
Modules under the talisman/phonetics
namespace:
- alpha-sis
- caverphone
- daitch-mokotoff
- double-metaphone
- eudex
- fuzzy-soundex
- lein
- metaphone
- mra
- nysiis
- onca
- phonex
- roger-root
- sound-d
- soundex
- statcan
Phonetic algorithms for other languages
Use case
Let’s say we want to compare two fairly similar names like Catherine & Kathryn.
One human would very easily agree that those two names do sound the same.
But, for a computer, stating this simple fact is daunting since:
'Catherine' !== 'Kathryn'
Phonetic algorithms are therefore a way to solve this problem because they will try to produce a phonetic representation of the given strings that can be used to match them if they sound roughly the same.
// Using the metaphone algorithm, for instance
import metaphone from 'talisman/phonetics/metaphone';
const catherineCode = metaphone('Catherine'),
kathrynCode = metaphone('Kathryn');
catherineCode
>>> 'K0RN'
kathrynCode
>>> 'K0RN'
catherineCode === kathrynCode
>>> true
alpha-sis
Reference: https://archive.org/stream/accessingindivid00moor#page/15/mode/1up
Accessing individual records from personal data files using non-unique identifiers” / Gwendolyn B. Moore, et al.; prepared for the Institute for Computer Sciences and Technology, National Bureau of Standards, Washington, D.C (1977)
This algorithm, from IBM’s Alpha Search Inquiry System (Alpha SIS), produces 14 characters-long Soundex-like codes.
Note that it will return a list rather than a single code because it will try to encode some characters sequences, such as “DZ” for instance, using two or three possibilities (and all permutations are thusly returned).
import alphaSis from 'talisman/phonetics/alpha-sis';
alphaSis('Rogers');
>>> ['04740000000000']
caverphone
Reference: https://en.wikipedia.org/wiki/Caverphone
Original algorithm: http://caversham.otago.ac.nz/files/working/ctp060902.pdf
Revisited algorithm: http://caversham.otago.ac.nz/files/working/ctp150804.pdf
The Caversham project: http://caversham.otago.ac.nz/
The caverphone algorithm, written by David Hood for the Caversham project, aims at encoding names and specifically targeting names from New Zealand.
However, this shouldn’t stop you from trying it on any dataset.
The library packs both the original & the revisited version of the algorithm.
import caverphone from 'talisman/phonetics/caverphone';
// Alternatively
import {original, revisited} from 'talisman/phonetics/caverphone';
caverphone === original
>>> true
caverphone('Henrichsen');
>>> 'ANRKSN1111'
revisited('Henrichsen')
>>> 'ANRKSN1111'
Original version
Revisited version
daitch-mokotoff
Reference: https://en.wikipedia.org/wiki/Daitch%E2%80%93Mokotoff_Soundex
The Daitch-Mokotoff Soundex is a refinement of the American Soundex to match more properly Slavic & Yiddish names.
Note that sometimes, this algorithm give different solutions for encoding a sound.
Thus, the function will always return an array of possible encodings listing all the possible permutations (at least one, obviously).
import daitchMokotoff from 'talisman/phonetics/daitch-mokotoff';
daitchMokotoff('Peters');
>>> ['739400', '734000']
double-metaphone
Reference: https://en.wikipedia.org/wiki/Metaphone
The double metaphone algorithm, created in 2000 by Lawrence Philips, is an improvement over the original metaphone algorithm.
It is called “double” because the algorithm will try to produce two possibilities for the phonetic encoding of the given string.
Note however, that unlike the original metaphone, the length of the produced code will never exceed 4 characters.
import doubleMetaphone from 'talisman/phonetics/double-metaphone';
doubleMetaphone('Smith');
>>> ['SM0', 'XMT']
eudex
Reference: https://github.com/ticki/eudex
Author: ticki
Eudex is a phonetic hashing algorithm that will produce a 64bits integer holding information about the given word.
The produced hashed can be used afterwards by specific distance metrics to determine whether two given words seem phonetically similar or not.
Important: this function will return a 64bits integer wrapped in a Long
object from the long node library since JavaScript is natively unable to deal with such integers.
import eudex from 'talisman/phonetics/eudex';
eudex('Guillaume');
>>> <Long>288230378836066816
fuzzy-soundex
Holmes, David and M. Catherine McCabe. “Improving Precision and Recall for Soundex Retrieval.”
This algorithm is designed as an improvement over the classical Soundex.
This improvement is achieved by performing some substitutions in the style of what the NYSIIS algorithm does, plus fuzzying some name beginnings & endings.
import fuzzySoundex from 'talisman/phonetics/fuzzy-soundex';
fuzzySoundex('Rogers');
>>> 'R769'
lein
Reference: http://naldc.nal.usda.gov/download/27833/PDF
The Lein name coding procedure is a Soundex-like algorithm than will produce a 4-character code for the given name.
import lein from 'talisman/phonetics/lein';
lein('Michael');
>>> 'M530'
metaphone
Reference: https://en.wikipedia.org/wiki/Metaphone
The metaphone algorithm, created in 1990 by Lawrence Philips, is a phonetic algorithm working on dictionary words (rather than only processing names, as phonetic algorithms usually do).
Note also that the algorithm will not truncate the given word to output a codex limited to a specific number of letters.
Today, however, we often prefer to use the “improved” version of the algorithm called the double metaphone.
import metaphone from 'talisman/phonetics/metaphone';
metaphone('Michael');
>>> 'MXL'
mra
Reference: https://en.wikipedia.org/wiki/Match_rating_approach
Moore, G B.; Kuhns, J L.; Treffzs, J L.; Montgomery, C A. (Feb 1, 1977). Accessing Individual Records from Personal Data Files Using Nonunique Identifiers. US National Institute of Standards and Technology. p. 17. NIST SP - 500-2.
This algorithm will compute the Match Rating Approach codex used by the same method to establish the similarity between two names.
This function is exported by the library for reference, but you should probably use talisman/metrics/mra
instead.
import mra from 'talisman/phonetics/mra';
mra('Kathryn');
>>> 'KTHRYN'
nysiis
Reference: https://en.wikipedia.org/wiki/New_York_State_Identification_and_Intelligence_System
The New York State Identification and Intelligence System is basically a more modern alternative to the original Soundex.
Like its counterpart, it has been created to match names and is not really suited for dictionary words.
The library packs both the original and the refined version of the algorithm.
import nysiis from 'talisman/phonetics/nysiis';
// Alternatively
import {original, refined} from 'talisman/phonetics/nysiis';
nysiis === original
>>> true
nysiis('Philbert');
>>> 'FFALBAD'
nysiis('Philbert');
>>> 'FALBAD'
Original version
Refined version
onca
The Oxford Name Compression Algorithm (ONCA).
Basically a glorified combination of the NYSIIS algorithm & the Soundex one.
import onca from 'talisman/phonetics/onca';
lein('Dionne');
>>> 'D500'
phonex
Reference: http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf
Lait, A. J. and B. Randell. “An Assessment of Name Matching Algorithms”.
This algorithm is an improved version of the Soundex algorithm.
Its main change is to better fuzz some very common cases missed by the Soundex algorithm in order to match more orthographic variations.
import phonex from 'talisman/phonetics/phonex';
phonex('Rogers');
>>> 'R26'
roger-root
Reference: http://naldc.nal.usda.gov/download/27833/PDF
The Roger Root name coding procedure is a Soundex-like algorithm than will produce a 5-character code (completely numerical) for the given name.
Its specificity is to encode the beginning of the names differently than their rest.
import rogerRoot from 'talisman/phonetics/roger-root';
rogerRoot('Michael');
>>> '03650'
sound-d
Hybrid Matching Algorithm for Personal Names. Cihan Varol, Coskun Bayrak.
The SoundD algorithm is a slight variant of the Soundex algorithm.
import soundD from 'talisman/phonetics/sound-d';
soundD('Martha');
>>> '5630'
soundex
Reference: https://en.wikipedia.org/wiki/Soundex
The Soundex algorithm, created by Robert Russell and Margaret Odell, is often considered to be the first phonetic algorithm in history.
Note that it aims at matching anglo-saxons names and won’t work well on dictionary words.
You are also free to use the refined version of this algorithm, as found in the Apache projects.
import soundex from 'talisman/phonetics/soundex';
// Alternatively
import {refined} from 'talisman/phonetics/soundex';
soundex('Michael');
>>> 'M240'
refined('Michael');
>>> 'M80307'
Original version
Refined version
statcan
Reference: http://naldc.nal.usda.gov/download/27833/PDF
The census modified statistics Canada name coding procedure is a Soundex-like algorithm than will produce a 4-character code (alphabetical) for the given name.
import statcan from 'talisman/phonetics/statcan';
statcan('Michael');
>>> 'MCHL'