The library(porter_stem)
library implements the stemming
algorithm described by Porter in Porter, 1980, “An algorithm for
suffix stripping” , Program, Vol. 14, no. 3, pp 130-137. The
library comes with some additional predicates that are commonly used in
the context of stemming.
[-+][0-9]+(\.[0-9]+)?([eE][-+][0-9]+)? | number |
[:alpha:][:alnum:]+ | word |
[:space:]+ | skipped |
anything else | single-character |
Character classification is based on the C-library iswalnum() etc. functions. Recognised numbers are passed to Prolog read/1, supporting unbounded integers.
It is likely that future versions of this library will provide tokenize_atom/3 with additional options to modify space handling as well as the definition of words.
The code is based on the original Public Domain implementation by Martin Porter as can be found at http://www.tartarus.org/martin/PorterStemmer/. The code has been modified by Jan Wielemaker. He removed all global variables to make the code thread-safe, added the unaccent and tokenize code and created the SWI-Prolog binding.