class Ferret::Analysis::StandardTokenizer
Summary¶ ↑
The standard tokenizer is an advanced tokenizer which tokenizes most words correctly as well as tokenizing things like email addresses, web addresses, phone numbers, etc.
Example¶ ↑
"Dave's résumé, at http://www.davebalmain.com/ 1234" => ["Dave's", "résumé", "at", "http://www.davebalmain.com", "1234"]
Public Class Methods
new(lower = true) → tokenizer
click to toggle source
Create a new StandardTokenizer which optionally downcases tokens. Downcasing is done according the current locale.
- lower
-
set to false if you don't wish to downcase tokens
static VALUE
frb_standard_tokenizer_init(VALUE self, VALUE rstr)
{
#ifndef POSH_OS_WIN32
if (!frb_locale) frb_locale = setlocale(LC_CTYPE, "");
#endif
return get_wrapped_ts(self, rstr, mb_standard_tokenizer_new());
}