понедельник, 2 ноября 2009 г.

Some enhancements for FTS3

The FTS3 work fine but is really unfriendly to developers. As example it is easy to write
tcl interface code for snowball stemmer utility "stemwords" and for stopwords dictionary
but there are no ways to use it in FTS3. The user functions can be easy writed on C or
any other language but FTS3 does not work with these.

1. There are no interfaces for stemmer, stopwords dictionary, etc. in the FTS3 extension.
It's very difficult to understand the code of FTS3 extension and change it. Is it possible to
add calls of user-defined functions for this tasks?

The virtual table creating command may be extended as

sql-command ::= CREATE VIRTUAL TABLE [ database-name .] table-name USING fts3
[( [ argument [, argument, [argument, ...] ]*] )]
argument ::= name | TOKENIZE tokenizer | FUNCTION user_function
tokenizer ::= SIMPLE | PORTER | user-defined

When FUNCTION return null than the word must be ignored else the tokenized word is
replaced by returned from function.

As example application can bind these functions like as

#!/usr/bin/tclsh8.5
package require sqlite3
sqlite3 db :memory:
proc stopword {word} {
...
}
proc stemmer {word} {
...
}
db function stopword stopword
db function stemmer stemmer
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU,
FUNCTION stopword, FUNCTION stemmer);}

Of cource we can extend the example above with a synonyms dictionary function or
internal soundex() function or other.

I think the feature is "must have".

2. The snippet function have now the ability for change snippet text size and return
very small text fragment. As example the standalone unix diff -u command return 3
lines before and after context and this can easy be changed by command-line
arguments. Yes, application can use self snippet realization on base of offsets()
information but it's produce additional difficults.

3. The user defined tokenizer function will be very helpful. The tokenizer is stream
interface and must have the stream position so the user defined tokenizer can have
the interface like to

tokenizer (document_text, document_position)

This function can be called from xNext() interface function.

I don't sure about the realization and may be the interface will be different.

Комментариев нет:


(C) Alexey Pechnikov aka MBG, mobigroup.ru