Keyers
The keyers
module gathers several methods aiming at producing string fingerprints fit for fuzzy matching.
Summary
Use case
Let’s say we stumbled upon those three lines in a table:
- “University of north Carolina”
- “University of of North Carolina.”
- “Carolina, North university of”
One would easily agree that they are in fact duplicates & this is exactly the goal of this module’s functions to be able to match them.
Those methods are indeed producing normalized “fingerprints” for the given strings so their users may match lines that look the same but are not perfectly equal for a computer.
For instance, the basic fingerprint method would produce the following key for all three examples above:
carolina north of university
which is garbage for a human of course, but enables a computer to match those three different lines.
N.B. For different keying mechanisms involving phonetic representation of the given strings, be sure to check this other module.
fingerprint
The fingerprint method applies the following transformation to the given string:
- Trimming
- Lowercasing
- Dropping punctuation & control characters
- Splitting the string into word tokens
- Dropping duplicate words
- Sorting the tokens alphabetically
- Rejoining the tokens separating them with a whitespace
- Normalizing accents
import fingerprint from 'talisman/keyers/fingerprint';
fingerprint('University of north Carolina');
>>> 'carolina north of university'
· | ||
ngram-fingerprint
The ngram-fingerprint method is quite similar to the fingerprint one, except it will apply it on the ngrams of the given string.
import ngramFingerprint from 'talisman/keyers/ngram-fingerprint';
ngramFingerprint(2, 'University of north Carolina');
>>> 'arcaerfnhcinitivlinaninoofolorrorsrtsithtyunveyo'
Arguments
- n number: size of the grams.
- string string: target string.
Bigrams
· | ||
Trigrams
· | ||
name-sig
Similarity Analysis of Patients’ Data: Bangladesh Perspective. Shahidul Islam Khan, Abu Sayed Md. Latiful Hoque. December 17, 2016
The name significance “NameSig” similarity key. A keyer attempting to simplify names in order to make variations match.
import namesig from 'talisman/keyers/name-sig';
namesig('Mr. Abdul Haque');
>>> 'abdlhk'
· | ||
omission
Reference:
http://dl.acm.org/citation.cfm?id=358048
Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368.
The omission key by Joseph Pollock & Antonio Zamora.
import omission from 'talisman/keyers/omission';
omission('University of north Carolina');
>>> 'VYFHCLNTSRUIEOA'
· | ||
skeleton
Reference:
http://dl.acm.org/citation.cfm?id=358048
Pollock, Joseph J. and Antonio Zamora. 1984. “Automatic Spelling Correction in Scientific and Scholarly Text.” Communications of the ACM, 27(4). 358–368.
The skeleton key by Joseph Pollock & Antonio Zamora.
import skeleton from 'talisman/keyers/skeleton';
skeleton('University of north Carolina');
>>> 'UNVRSTYFHCLIEOA'
· | ||