Words tokenizers
The tokenizers/words module gathers the library’s various words tokenizers.
Words tokenizers take raw text and output a list of this text’s words.
Summary
gersam
Reference: http://www.statmt.org/moses/
This heuristical word tokenizer is inspired by the Moses machine translation system’s one. It supports a lot of languages and is able to handle many tricky cases.
Gersam is the Latin name of Moses’ firstborn child.
import words from 'talisman/tokenizers/gersam';
gersam('en', 'Hello World!');
>>> ['Hello', 'World', '!']
Arguments
- lang string: target language in ISO 639-1 format.
- text string: text to tokenize.
English version
French version
naive
Reference: https://lodash.com/
This words tokenizer is actually lodash’s words function re-exported for convenience.
It’s quite robust and works in a lot of simple cases.
import words from 'talisman/tokenizers/words';
words('Hello World!');
>>> ['Hello', 'World']
treebank
Reference: http://www.cis.upenn.edu/~treebank/tokenizer.sed
This words tokenizer is one of the most popular regular expression tokenizers for the English language.
It is able to split expressions such as isn’t into two proper tokens.
Note that it won’t strip punctuation and keep it as tokens.
import treebank from 'talisman/tokenizers/words/treebank';
treebank("It wasn't me!");
>>> ['It', 'was', 'n\'t', 'me', '!']