Tests of the Zlib-patched FTS3
The test suite is simple. Are using tcl script to scan directories and some filters are based on the tracker project filters. I don't use transactions in my tests but it's easy to change tcl script for outher tests. The pragmas values are page_size=4096 and cache_size=128000 in my SQLite build.
The scanned directory have a subdirectories and a files of different types.
The test Tcl script.
The html to plaintext filter ./filters/text/html_filter
The ms word documents to plaintext filter ./filters/application/msword_filter
The FTS3 extension with the Zlib-compression of the document and the metadata:
The upstream FTS3 extension:
Upd.
The FTS3 extension with the Zlib-compression of document only:
Results.
1. The metadata compression is not useful. I think the metadata realization is nice.
2. The documents compression decreasing speed of count(*) selects. I think it's error in FTS3 virtual table realisation or in my compression code.
3. The database size by compressing a documents is decreased about 3x factor on my test docset.
4. The speed of selecting documents snippets is de-facto independent of the documents compression.
Upd. Modified scanner Tcl script.
The scanned directory have a subdirectories and a files of different types.
$ du -sh /mnt/backup/project/offline1/www/share
290M /mnt/backup/project/offline1/www/share
$ find /mnt/backup/project/offline1/www/share -type f|wc -l
7122
$ find /mnt/backup/project/offline1/www/share -type f|grep .html|wc -l
4321
$ find /mnt/backup/project/offline1/www/share -type f|grep .doc|wc -l
89
$ find /mnt/backup/project/offline1/www/share -type f|grep .xsl|wc -l
37
$ find /mnt/backup/project/offline1/www/share -type f|grep .pdf|wc -l
4
The test Tcl script.
#!/usr/bin/tclsh8.5
# ./scan.tcl /mnt/backup/project/offline1/www/share
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}
if {$argc <1} {
puts "Use as $argv0 directory"
exit
}
set root [lindex $argv 0]
set files [exec find $root]
set files [split $files \n]
foreach file $files {
catch {
if {[file type $file] ne {file}} continue
set type [exec file --brief --mime-type $file]
if {[file exists ./filters/${type}_filter]} {
set md5 [string range [exec md5sum $file] 0 31]
set text [exec ./filters/${type}_filter $file]
# puts "$file => $type"
db eval {insert into t (content) values($text)}
}
}
}
db eval {vacuum}
The html to plaintext filter ./filters/text/html_filter
#!/bin/sh
nice -n19 w3m \
-o indent_incr=0 \
-o multicol=false \
-o no_cache=true \
-o use_cookie=false \
-o display_charset=utf8 \
-o system_charset=utf8 \
-o follow_locale=false \
-o use_language_tag=true \
-o ucs_conv=true \
-T text/html \
-dump \
"$1"
The ms word documents to plaintext filter ./filters/application/msword_filter
#!/bin/sh
nice -n19 wvWare --nographics "$1" |w3m \
-o indent_incr=0 \
-o multicol=false \
-o no_cache=true \
-o use_cookie=false \
-o display_charset=utf8 \
-o system_charset=utf8 \
-o follow_locale=false \
-o use_language_tag=true \
-o ucs_conv=true \
-T text/html \
-dump
The FTS3 extension with the Zlib-compression of the document and the metadata:
time ./scan.tcl /mnt/backup/project/offline1/www/share
real 10m18.639s
user 5m24.404s
sys 2m8.632s
ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя 1 17:33 scan.db
sqlite> select count(*) from t;
5032
sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000
sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.184011 sys 0.008000
sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468030 sys 0.024002
The upstream FTS3 extension:
$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real 11m36.446s
user 5m20.564s
sys 3m14.592s
$ ls -lh scan.db
-rw-r--r-- 1 veter veter 55M Ноя 1 18:24 scan.db
sqlite> select count(*) from t;
5032
sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.004000 sys 0.000000
sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.176011 sys 0.004001
sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.064004 sys 0.044003
Upd.
The FTS3 extension with the Zlib-compression of document only:
$ time ./scan.tcl /mnt/backup/project/offline1/www/share
real 10m41.661s
user 5m23.912s
sys 2m9.004s
$ ls -lh scan.db
-rw-r--r-- 1 veter veter 19M Ноя 1 19:42 scan.db
sqlite> select count(*) from t;
5032
sqlite> select count(*) from t where t match 'алкатель';
10
CPU Time: user 0.016001 sys 0.004000
sqlite> select snippet(t, '[', ']', '%%') from t where t match 'алкатель';
%% рамках акции "Телефон [Алкатель] с SIM-картой МТС за 990 %%
...
%% │
│ [Алкатель] с │ %%
CPU Time: user 0.184012 sys 0.008000
sqlite> select count(*) from t where t match 'абонент';
922
CPU Time: user 0.468029 sys 0.020001
Results.
1. The metadata compression is not useful. I think the metadata realization is nice.
2. The documents compression decreasing speed of count(*) selects. I think it's error in FTS3 virtual table realisation or in my compression code.
3. The database size by compressing a documents is decreased about 3x factor on my test docset.
4. The speed of selecting documents snippets is de-facto independent of the documents compression.
Upd. Modified scanner Tcl script.
#!/usr/bin/tclsh8.5
# find /mnt/backup/project/offline1/www/share | ./scan.tcl
package require sqlite3
sqlite3 db scan.db
db eval {CREATE VIRTUAL TABLE t USING fts3(content, TOKENIZE icu ru_RU)}
while {[eof stdin] == 0} {
set file [gets stdin]
catch {
if {[file type $file] ne {file}} continue
set type [exec file --brief --mime-type $file]
if {[file exists ./filters/${type}_filter]} {
set md5 [string range [exec md5sum $file] 0 31]
set text [exec ./filters/${type}_filter $file]
puts "$file => $type"
db eval {insert into t (content) values($text)}
}
}
}
db eval {vacuum}
Comments