Sentences tokenizers
The tokenizers/sentences
module gathers the library’s various sentences tokenizers.
Sentences tokenizers take raw text and output a list of this text’s sentences.
Summary
naive
Author: Guillaume Plique
This tokenizer is called “naive” because it relies only on regular expressions and some amount of exceptions’ listing.
It should work approximately well in a majority of texts correctly written in Western European languages.
import sentences from 'talisman/tokenizers/sentences';
sentences('Hello World! Goodbye everyone.');
>>> [
'Hello World!',
'Goodbye everyone.'
]
Creating a tokenizer with custom exceptions
The default tokenizer provided by the library has a limited knowledge of the exceptions you might encounter and only targets the English and to some extent the French language.
So, if what you need is to feed your own list of exceptions to the tokenizer, you can create a custom one very easily:
import {createTokenizer} from 'talisman/tokenizers/sentences';
// Pass your exceptions without the '.'
const customExceptions = ['Sgt', 'M', 'Mr'];
const customTokenizer = createTokenizer({exceptions: customExceptions});
customTokenizer('Hello Sgt. Loyall. How are you?');
>>> [
'Hello Sgt. Loyall',
'How are you?'
]