clj-fuzzy

A handy collection of algorithms dealing with fuzzy strings and phonetics.

Clojure


Installation

To install the lastest version from clojars, just add the following vector to the :dependencies section of your project.clj file.

[clj-fuzzy "0.4.1"]

Then run lein deps to process your dependencies.

If you would rather install the latest version from the current source, clone the repository and install with leiningen.

git clone https://github.com/Yomguithereal/clj-fuzzy.git
cd clj-fuzzy
lein install

Then include the same vector within your project.clj and you should be good to go.


Usage

clj-fuzzy ships with three API namespaces: clj-fuzzy.metrics, clj-fuzzy.stemmers and finally clj-fuzzy.phonetics.

Just require or use those and their relevant functions to run the algorithms.

In order to be the simplest possible, the following examples :use the clj-fuzzy namespaces. But you should really rely on a cleaner :require.


clj-fuzzy.metrics

clj-fuzzy.stemmers

clj-fuzzy.phonetics


clj-fuzzy.metrics

Sorensen / Dice coefficient

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Dice coefficient of two words
(dice "healed" "sealed")
0.8

(dice "healed" "herded")
0.4

;; Or if you want to choose the n-grams size
(dice "bar" "baz" :n 3)
0

;; There is also a Sorensen alias
(sorensen "healed" "herded")
0.4

Levenshtein distance

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Levenshtein distance between two words
(levenshtein "book" "back")
2

(levenshtein "hello" "helo")
1

Hamming distance

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Hamming distance between two words
(hamming "ramer" "cases")
3

(hamming '(0 1 0 1) '(1 1 0 1))
1

Jaccard / Tanimoto distance

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Jaccard distance between two words
;; 0 meaning two identical strings and 1 two totally different ones
(jaccard "abc" "xyz")
1

(jaccard "night" "nacht")
4/7

;; If you are more the Tanimoto kind of guy, an alias exists
(tanimoto "night" "nacht")
4/7

Jaro-Winkler distance

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Jaro distance between two words
(jaro "Dwayne" "Duane")
0.8222222222222223

;; Compute the Jaro-Winkler distance between two words
(jaro-winkler "Dwayne" "Duane")
0.8400000000000001

MRA Comparison

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compare two string using the Match Rating Approach
(mra-comparison "Byrne" "Boern")
{:minimum 4
 :similarity 5
 :codex ["BYRN" "BRN"]
 :match true}

Tversky Index

(ns my.clojure-namespace
  (:use clj-fuzzy.metrics))

;; Compute the Tversky index of two sequences.
(tversky "night" "nacht")
3/7

;; Compute the same index for a precise alpha and beta value
;; Default value is alpha = beta = 1 and produces the Jaccard coefficient
;; alpha = beta = 0.5 produces the Dice coefficient (without bigrams)
(tversky "healed" "sealed" :alpha 0.5 :beta 0.5)
0.8

;; You can also specify whether you want to compute the
;; symmetric variant of the index
(tversky "healed" "sealed" :alpha 1 :beta 1 :symmetric true)
0.8

clj-fuzzy.stemmers

Lancaster stemmer

(ns my.clojure-namespace
  (:use clj-fuzzy.stemmers))

;; Compute the stem of a word
(lancaster "worker")
"work"

(lancaster "presumably")
"presum"

Lovins stemmer

(ns my.clojure-namespace
  (:use clj-fuzzy.stemmers))

;; Compute the stem of a word
(lovins "nationality")
"nat"

(lovins "analytic")
"analys"

Porter stemmer

(ns my.clojure-namespace
  (:use clj-fuzzy.stemmers))

;; Compute the stem of a word
(porter "adjective")
"adject"

(porter "building")
"build"

Schinke stemmer

(ns my.clojure-namespace
  (:use clj-fuzzy.stemmers))

;; Compute the stem of a word
(schinke "aquila")
{:noun "aquil" :verb "aquila"}

(schinke "apparebunt")
{:noun "apparebu" :verb "apparebi"}

clj-fuzzy.phonetics

Metaphone

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the metaphone code for a single word
(metaphone "hypocrite")
"HPKRT"

(metaphone "discrimination")
"TSKRMNXN"

Double Metaphone

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the double metaphone of a word
(double-metaphone "Smith")
["SM0" "XMT"]

(double-metaphone "Schmidt")
["XMT" "SMT"]

Soundex

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the soundex code of a single name
(soundex "Ashcroft")
"A261"

(soundex "Andrew")
"A536"

NYSIIS

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the NYSIIS code of a single name
(nysiis "Andrew")
"ANDR"

(nysiis "Mclaughlin")
"MCLAGLAN"

;; Compute the refined NYSIIS code of a single name
(nysiis "Aegir" :refined)
"AGAR"

Caverphone

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the caverphone code of a single name
(caverphone "Henrichsen")
"ANRKSN1111"

(caverphone "Mclaverty")
"MKLFTA1111"

;; Compute the "revisited" caverphone code of a single name
(caverphone "Stevenson" :revisited)
"STFNSN1111"

Cologne Phonetic

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the cologne phonetic code of a single word
(cologne "Müller-Lüdenscheidt")
"65752682"

(cologne "Breschnew")
"17863"

MRA Codex

(ns my.clojure-namespace
  (:use clj-fuzzy.phonetics))

;; Compute the MRA codex of a single name
(mra-codex "Catherine")
"CTHRN"

(mra-codex "Smith")
"SMTH"