Words tokenizers

The tokenizers/words module gathers the library’s various words tokenizers.

Words tokenizers take raw text and output a list of this text’s words.

Summary

gersam
naive
treebank

gersam

This heuristical word tokenizer is inspired by the Moses machine translation system’s one. It supports a lot of languages and is able to handle many tricky cases.

Gersam is the Latin name of Moses’ firstborn child.

import words from 'talisman/tokenizers/gersam';

gersam('en', 'Hello World!');
>>> ['Hello', 'World', '!']

Arguments

lang string: target language in ISO 639-1 format.
text string: text to tokenize.

English version

French version

naive

Reference: https://lodash.com/

This words tokenizer is actually lodash’s words function re-exported for convenience.

It’s quite robust and works in a lot of simple cases.

import words from 'talisman/tokenizers/words';

words('Hello World!');
>>> ['Hello', 'World']

treebank

Reference: http://www.cis.upenn.edu/~treebank/tokenizer.sed

This words tokenizer is one of the most popular regular expression tokenizers for the English language.

It is able to split expressions such as isn’t into two proper tokens.

Note that it won’t strip punctuation and keep it as tokens.

import treebank from 'talisman/tokenizers/words/treebank';

treebank("It wasn't me!");
>>> ['It', 'was', 'n\'t', 'me', '!']