A handy collection of algorithms dealing with fuzzy strings and phonetics.
To install the lastest version from clojars, just add the following vector to the :dependencies
section of your project.clj
file.
[clj-fuzzy "0.4.1"]
Then run lein deps
to process your dependencies.
If you would rather install the latest version from the current source, clone the repository and install with leiningen.
git clone https://github.com/Yomguithereal/clj-fuzzy.git
cd clj-fuzzy
lein install
Then include the same vector within your project.clj
and you should be good to go.
clj-fuzzy ships with three API namespaces: clj-fuzzy.metrics
, clj-fuzzy.stemmers
and finally clj-fuzzy.phonetics
.
Just require or use those and their relevant functions to run the algorithms.
In order to be the simplest possible, the following examples :use
the clj-fuzzy namespaces. But you should really rely on a cleaner :require
.
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Dice coefficient of two words
(dice "healed" "sealed")
0.8
(dice "healed" "herded")
0.4
;; Or if you want to choose the n-grams size
(dice "bar" "baz" :n 3)
0
;; There is also a Sorensen alias
(sorensen "healed" "herded")
0.4
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Levenshtein distance between two words
(levenshtein "book" "back")
2
(levenshtein "hello" "helo")
1
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Hamming distance between two words
(hamming "ramer" "cases")
3
(hamming '(0 1 0 1) '(1 1 0 1))
1
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Jaccard distance between two words
;; 0 meaning two identical strings and 1 two totally different ones
(jaccard "abc" "xyz")
1
(jaccard "night" "nacht")
4/7
;; If you are more the Tanimoto kind of guy, an alias exists
(tanimoto "night" "nacht")
4/7
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Jaro distance between two words
(jaro "Dwayne" "Duane")
0.8222222222222223
;; Compute the Jaro-Winkler distance between two words
(jaro-winkler "Dwayne" "Duane")
0.8400000000000001
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compare two string using the Match Rating Approach
(mra-comparison "Byrne" "Boern")
{:minimum 4
:similarity 5
:codex ["BYRN" "BRN"]
:match true}
(ns my.clojure-namespace
(:use clj-fuzzy.metrics))
;; Compute the Tversky index of two sequences.
(tversky "night" "nacht")
3/7
;; Compute the same index for a precise alpha and beta value
;; Default value is alpha = beta = 1 and produces the Jaccard coefficient
;; alpha = beta = 0.5 produces the Dice coefficient (without bigrams)
(tversky "healed" "sealed" :alpha 0.5 :beta 0.5)
0.8
;; You can also specify whether you want to compute the
;; symmetric variant of the index
(tversky "healed" "sealed" :alpha 1 :beta 1 :symmetric true)
0.8
(ns my.clojure-namespace
(:use clj-fuzzy.stemmers))
;; Compute the stem of a word
(lancaster "worker")
"work"
(lancaster "presumably")
"presum"
(ns my.clojure-namespace
(:use clj-fuzzy.stemmers))
;; Compute the stem of a word
(lovins "nationality")
"nat"
(lovins "analytic")
"analys"
(ns my.clojure-namespace
(:use clj-fuzzy.stemmers))
;; Compute the stem of a word
(porter "adjective")
"adject"
(porter "building")
"build"
(ns my.clojure-namespace
(:use clj-fuzzy.stemmers))
;; Compute the stem of a word
(schinke "aquila")
{:noun "aquil" :verb "aquila"}
(schinke "apparebunt")
{:noun "apparebu" :verb "apparebi"}
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the metaphone code for a single word
(metaphone "hypocrite")
"HPKRT"
(metaphone "discrimination")
"TSKRMNXN"
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the double metaphone of a word
(double-metaphone "Smith")
["SM0" "XMT"]
(double-metaphone "Schmidt")
["XMT" "SMT"]
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the soundex code of a single name
(soundex "Ashcroft")
"A261"
(soundex "Andrew")
"A536"
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the NYSIIS code of a single name
(nysiis "Andrew")
"ANDR"
(nysiis "Mclaughlin")
"MCLAGLAN"
;; Compute the refined NYSIIS code of a single name
(nysiis "Aegir" :refined)
"AGAR"
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the caverphone code of a single name
(caverphone "Henrichsen")
"ANRKSN1111"
(caverphone "Mclaverty")
"MKLFTA1111"
;; Compute the "revisited" caverphone code of a single name
(caverphone "Stevenson" :revisited)
"STFNSN1111"
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the cologne phonetic code of a single word
(cologne "Müller-Lüdenscheidt")
"65752682"
(cologne "Breschnew")
"17863"
(ns my.clojure-namespace
(:use clj-fuzzy.phonetics))
;; Compute the MRA codex of a single name
(mra-codex "Catherine")
"CTHRN"
(mra-codex "Smith")
"SMTH"