Keyers

The keyers module gathers several methods aiming at producing string fingerprints fit for fuzzy matching.

Summary

fingerprint
ngram-fingerprint
name-sig
omission
skeleton

Use case

Let’s say we stumbled upon those three lines in a table:

“University of north Carolina”
“University of of North Carolina.”
“Carolina, North university of”

One would easily agree that they are in fact duplicates & this is exactly the goal of this module’s functions to be able to match them.

Those methods are indeed producing normalized “fingerprints” for the given strings so their users may match lines that look the same but are not perfectly equal for a computer.

For instance, the basic fingerprint method would produce the following key for all three examples above:

carolina north of university

which is garbage for a human of course, but enables a computer to match those three different lines.

N.B. For different keying mechanisms involving phonetic representation of the given strings, be sure to check this other module.

fingerprint

The fingerprint method applies the following transformation to the given string:

Trimming
Lowercasing
Dropping punctuation & control characters
Splitting the string into word tokens
Dropping duplicate words
Sorting the tokens alphabetically
Rejoining the tokens separating them with a whitespace
Normalizing accents

import fingerprint from 'talisman/keyers/fingerprint';

fingerprint('University of north Carolina');
>>> 'carolina north of university'

ngram-fingerprint

The ngram-fingerprint method is quite similar to the fingerprint one, except it will apply it on the ngrams of the given string.

import ngramFingerprint from 'talisman/keyers/ngram-fingerprint';

ngramFingerprint(2, 'University of north Carolina');
>>> 'arcaerfnhcinitivlinaninoofolorrorsrtsithtyunveyo'

Arguments

n number: size of the grams.
string string: target string.

Bigrams

Trigrams

name-sig

Similarity Analysis of Patients’ Data: Bangladesh Perspective. Shahidul Islam Khan, Abu Sayed Md. Latiful Hoque. December 17, 2016

The name significance “NameSig” similarity key. A keyer attempting to simplify names in order to make variations match.

import namesig from 'talisman/keyers/name-sig';

namesig('Mr. Abdul Haque');
>>> 'abdlhk'

omission

Reference:
http://dl.acm.org/citation.cfm?id=358048

Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368.

The omission key by Joseph Pollock & Antonio Zamora.

import omission from 'talisman/keyers/omission';

omission('University of north Carolina');
>>> 'VYFHCLNTSRUIEOA'

skeleton

Reference:
http://dl.acm.org/citation.cfm?id=358048

Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368.

The skeleton key by Joseph Pollock & Antonio Zamora.

import skeleton from 'talisman/keyers/skeleton';

skeleton('University of north Carolina');
>>> 'UNVRSTYFHCLIEOA'