Tests of the Zlib-patched FTS3

November 01, 2009

The test suite is simple. Are using tcl script to scan directories and some filters are based on the tracker project filters. I don't use transactions in my tests but it's easy to change tcl script for outher tests. The pragmas values are page_size=4096 and cache_size=128000 in my SQLite build.

The scanned directory have a subdirectories and a files of different types.


$ du -sh /mnt/backup/project/offline1/www/share
290M    /mnt/backup/project/offline1/www/share
$ find /mnt/backup/project/offline1/www/share -type f|wc -l
7122
$ find /mnt/backup/project/offline1/www/share -type f|grep .html|wc -l
4321
$ find /mnt/backup/project/offline1/www/share -type f|grep .doc|wc -l
89
$ find /mnt/backup/project/offline1/www/share -type f|grep .xsl|wc -l
37
$ find /mnt/backup/project/offline1/www/share -type f|grep .pdf|wc -l
4

The test Tcl script.


#!/usr/bin/tclsh8.5
# ./scan.tcl /mnt/backup/project/offline1/www/share
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}

if {$argc <1} {
    puts "Use as $argv0 directory"
    exit
}
set root [lindex $argv 0]

set files [exec find $root]
set files [split $files \n]

foreach file $files {
    catch {
        if {[file type $file] ne {file}} continue
        set type [exec file --brief --mime-type $file]
        if {[file exists ./filters/${type}_filter]} {
            set md5 [string range [exec md5sum $file] 0 31]
            set text [exec ./filters/${type}_filter $file]
#            puts "$file => $type"
            db eval {insert into t (content) values($text)}
        }
    }
}
db eval {vacuum}

The html to plaintext filter ./filters/text/html_filter


#!/bin/sh

nice -n19 w3m \
    -o indent_incr=0 \
    -o multicol=false \
    -o no_cache=true \
    -o use_cookie=false \
    -o display_charset=utf8 \
    -o system_charset=utf8 \
    -o follow_locale=false \
    -o use_language_tag=true \
    -o ucs_conv=true \
    -T text/html \
    -dump \
    "$1"

The ms word documents to plaintext filter ./filters/application/msword_filter


#!/bin/sh

nice -n19 wvWare --nographics "$1" |w3m \
    -o indent_incr=0 \
    -o multicol=false \
    -o no_cache=true \
    -o use_cookie=false \
    -o display_charset=utf8 \
    -o system_charset=utf8 \
    -o follow_locale=false \
    -o use_language_tag=true \
    -o ucs_conv=true \
    -T text/html \
    -dump

The FTS3 extension with the Zlib-compression of the document and the metadata:


time ./scan.tcl /mnt/backup/project/offline1/www/share
real    10m18.639s
user    5m24.404s
sys     2m8.632s

ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя  1 17:33 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%%                             │
│     [Алкатель] с     │ %%
CPU Time: user 0.184011 sys 0.008000

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468030 sys 0.024002

The upstream FTS3 extension:


$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real    11m36.446s
user    5m20.564s
sys     3m14.592s

$ ls -lh scan.db
-rw-r--r-- 1 veter veter 55M Ноя  1 18:24 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.004000 sys 0.000000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%%                             │
│     [Алкатель] с     │ %%
CPU Time: user 0.176011 sys 0.004001

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.064004 sys 0.044003

Upd.

The FTS3 extension with the Zlib-compression of document only:


$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real    10m41.661s
user    5m23.912s
sys     2m9.004s

$ ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя  1 19:42 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%%                             │
│     [Алкатель] с     │ %%
CPU Time: user 0.184012 sys 0.008000

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468029 sys 0.020001

Results.

1. The metadata compression is not useful. I think the metadata realization is nice.
2. The documents compression decreasing speed of count(*) selects. I think it's error in FTS3 virtual table realisation or in my compression code.
3. The database size by compressing a documents is decreased about 3x factor on my test docset.
4. The speed of selecting documents snippets is de-facto independent of the documents compression.

Upd. Modified scanner Tcl script.


#!/usr/bin/tclsh8.5
# find /mnt/backup/project/offline1/www/share | ./scan.tcl
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}

while {[eof stdin] == 0} {
    set file [gets stdin]
    catch {
        if {[file type $file] ne {file}} continue
        set type [exec file --brief --mime-type $file]
        if {[file exists ./filters/${type}_filter]} {
            set md5 [string range [exec md5sum $file] 0 31]
            set text [exec ./filters/${type}_filter $file]
            puts "$file => $type"
            db eval {insert into t (content) values($text)}
        }
    }
}
db eval {vacuum}

GeoMapX

Tests of the Zlib-patched FTS3

Comments

Popular posts from this blog

Открытый софт для научных расчетов

Кольцевые структуры в геофизике

Модем Huawei E1550 в debian