воскресенье, 1 ноября 2009 г.

Tests of the Zlib-patched FTS3

The test suite is simple. Are using tcl script to scan directories and some filters are based on the tracker project filters. I don't use transactions in my tests but it's easy to change tcl script for outher tests. The pragmas values are page_size=4096 and cache_size=128000 in my SQLite build.

The scanned directory have a subdirectories and a files of different types.

$ du -sh /mnt/backup/project/offline1/www/share
290M /mnt/backup/project/offline1/www/share
$ find /mnt/backup/project/offline1/www/share -type f|wc -l
7122
$ find /mnt/backup/project/offline1/www/share -type f|grep .html|wc -l
4321
$ find /mnt/backup/project/offline1/www/share -type f|grep .doc|wc -l
89
$ find /mnt/backup/project/offline1/www/share -type f|grep .xsl|wc -l
37
$ find /mnt/backup/project/offline1/www/share -type f|grep .pdf|wc -l
4


The test Tcl script.

#!/usr/bin/tclsh8.5
# ./scan.tcl /mnt/backup/project/offline1/www/share
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}

if {$argc <1} {
puts "Use as $argv0 directory"
exit
}
set root [lindex $argv 0]

set files [exec find $root]
set files [split $files \n]

foreach file $files {
catch {
if {[file type $file] ne {file}} continue
set type [exec file --brief --mime-type $file]
if {[file exists ./filters/${type}_filter]} {
set md5 [string range [exec md5sum $file] 0 31]
set text [exec ./filters/${type}_filter $file]
# puts "$file => $type"
db eval {insert into t (content) values($text)}
}
}
}
db eval {vacuum}


The html to plaintext filter ./filters/text/html_filter

#!/bin/sh

nice -n19 w3m \
-o indent_incr=0 \
-o multicol=false \
-o no_cache=true \
-o use_cookie=false \
-o display_charset=utf8 \
-o system_charset=utf8 \
-o follow_locale=false \
-o use_language_tag=true \
-o ucs_conv=true \
-T text/html \
-dump \
"$1"


The ms word documents to plaintext filter ./filters/application/msword_filter

#!/bin/sh

nice -n19 wvWare --nographics "$1" |w3m \
-o indent_incr=0 \
-o multicol=false \
-o no_cache=true \
-o use_cookie=false \
-o display_charset=utf8 \
-o system_charset=utf8 \
-o follow_locale=false \
-o use_language_tag=true \
-o ucs_conv=true \
-T text/html \
-dump


The FTS3 extension with the Zlib-compression of the document and the metadata:

time ./scan.tcl /mnt/backup/project/offline1/www/share
real 10m18.639s
user 5m24.404s
sys 2m8.632s

ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя 1 17:33 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.184011 sys 0.008000

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468030 sys 0.024002


The upstream FTS3 extension:

$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real 11m36.446s
user 5m20.564s
sys 3m14.592s

$ ls -lh scan.db
-rw-r--r-- 1 veter veter 55M Ноя 1 18:24 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.004000 sys 0.000000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.176011 sys 0.004001

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.064004 sys 0.044003


Upd.

The FTS3 extension with the Zlib-compression of document only:

$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real 10m41.661s
user 5m23.912s
sys 2m9.004s

$ ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя 1 19:42 scan.db

sqlite> select count(*) from t;
5032

sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000

sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.184012 sys 0.008000

sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468029 sys 0.020001


Results.

1. The metadata compression is not useful. I think the metadata realization is nice.
2. The documents compression decreasing speed of count(*) selects. I think it's error in FTS3 virtual table realisation or in my compression code.
3. The database size by compressing a documents is decreased about 3x factor on my test docset.
4. The speed of selecting documents snippets is de-facto independent of the documents compression.

Upd. Modified scanner Tcl script.


#!/usr/bin/tclsh8.5
# find /mnt/backup/project/offline1/www/share | ./scan.tcl
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}

while {[eof stdin] == 0} {
set file [gets stdin]
catch {
if {[file type $file] ne {file}} continue
set type [exec file --brief --mime-type $file]
if {[file exists ./filters/${type}_filter]} {
set md5 [string range [exec md5sum $file] 0 31]
set text [exec ./filters/${type}_filter $file]
puts "$file => $type"
db eval {insert into t (content) values($text)}
}
}
}
db eval {vacuum}

Комментариев нет:


(C) Alexey Pechnikov aka MBG, mobigroup.ru