Sentences tokenizers

The tokenizers/sentences module gathers the library’s various sentences tokenizers.

Sentences tokenizers take raw text and output a list of this text’s sentences.

Summary

naive

naive

This tokenizer is called “naive” because it relies only on regular expressions and some amount of exceptions’ listing.

It should work approximately well in a majority of texts correctly written in Western European languages.

import sentences from 'talisman/tokenizers/sentences';

sentences('Hello World! Goodbye everyone.');
>>> [
  'Hello World!',
  'Goodbye everyone.'
]

Creating a tokenizer with custom exceptions

The default tokenizer provided by the library has a limited knowledge of the exceptions you might encounter and only targets the English and to some extent the French language.

So, if what you need is to feed your own list of exceptions to the tokenizer, you can create a custom one very easily:

import {createTokenizer} from 'talisman/tokenizers/sentences';

// Pass your exceptions without the '.'
const customExceptions = ['Sgt', 'M', 'Mr'];

const customTokenizer = createTokenizer({exceptions: customExceptions});

customTokenizer('Hello Sgt. Loyall. How are you?');
>>> [
  'Hello Sgt. Loyall',
  'How are you?'
]