class Ferret::Analysis::RegExpTokenizer
Summary¶ ↑
A tokenizer that recognizes tokens based on a regular expression passed to the constructor. Most possible tokenizers can be created using this class.
Example¶ ↑
Below is an example of a simple implementation of a LetterTokenizer using an RegExpTokenizer. Basically, a token is a sequence of alphabetic characters separated by one or more non-alphabetic characters.
# of course you would add more than just é RegExpTokenizer.new(input, /[[:alpha:]é]+/) "Dave's résumé, at http://www.davebalmain.com/ 1234" => ["Dave", "s", "résumé", "at", "http", "www", "davebalmain", "com"]
Constants
- REGEXP
Public Class Methods
new(input, /[[:alpha:]]+/)
click to toggle source
Create a new tokenizer based on a regular expression
- input
-
text to tokenizer
- regexp
-
regular expression used to recognize tokens in the input
static VALUE
frb_rets_init(int argc, VALUE *argv, VALUE self)
{
VALUE rtext, regex, proc;
TokenStream *ts;
rb_scan_args(argc, argv, "11&", &rtext, ®ex, &proc);
ts = rets_new(rtext, regex, proc);
Frt_Wrap_Struct(self, &frb_rets_mark, &frb_rets_free, ts);
object_add(ts, self);
return self;
}
Public Instance Methods
text = text → text
click to toggle source
Get the text being tokenized by the tokenizer.
static VALUE
frb_rets_get_text(VALUE self)
{
TokenStream *ts;
GET_TS(ts, self);
return RETS(ts)->rtext;
}
text = text → text
click to toggle source
Set the text to be tokenized by the tokenizer. The tokenizer gets reset to tokenize the text from the beginning.
static VALUE
frb_rets_set_text(VALUE self, VALUE rtext)
{
TokenStream *ts;
GET_TS(ts, self);
rb_hash_aset(object_space, ((VALUE)ts)|1, rtext);
StringValue(rtext);
RETS(ts)->rtext = rtext;
RETS(ts)->curr_ind = 0;
return rtext;
}