We provide a collection of C functions and data structures
to support compression and searching over textual files.The starting point
of our sofwtare libraries is the following result:
deMoura, Navarro, Ziviani,
Baeza-Yates, "Fast and Flexible Word Searching
on Compressed Text", ACM Trans Info Syst 2000.
in which the authors present a variant of the Huffman
compression scheme that allows to perform pattern-matching directly over
the compressed file. The resulting tool has been called CGrep, just to underline the fact that it is
a scan-based pattern-matching routine, similar to Grep, but now acting over compressed
files.
We have actually
developed two software libraries which implement the ideas above with some
added features:
Huffword Library:
This library offers some algorithms and data structures for the efficient
(de)compression of textual files. The compression paradigm is a variant of
the Huffman algorithm, here adapted to be byte-aligned, tagged and word oriented. More precisely, the tokens
are alphanumeric words, and the codewords are byte sequences with some tagged
bits that allow to guarantee synchronization. The tagging is obtained as
follows: the Huffman tree has fan-out 128, and the arc labels of 7 bits are
stored in the least significant bits of a byte; the codeword is finally constructed
by setting the first bit of the first byte to 1, and the first bit of each
other byte to 0. This compression paradigm can be improved by removing
each single space that occurs between two alphanumeric tokens, hereafter
called Spaceless Model.
CGrep Library:
This library offers all the pattern-matching functionalities that allow to
search over Huffword compressed files. Actually the search consists of three
steps: first the complex pattern query (possibly involving errors, reg exp,
....) is resolved against the dictionary of tokens, built at compression
time by the Huffword algorithm. The tokens that satisfy the query are coded
and then their codewords are searched directly onto the compressed file (note
that this is an exact multiple-patterns search). This approach induces a
twofold speedup: the search cost is actually proportional to the length of
the compressed file, no decompression at all is required; and the search
cost for complicated patterns (like reg exp, errors,....) depends on the
dictionary size, which is small compared to the entire file. The features
that we have implemented in addition to the original paper are: proximity
query, arbitrary number of query-patters each with its search options, snippet
extraction.
We remark
that the above are libraries of C functions that
can be adopted by anyone wishing to play with its features and possibly build
more sophisticated search engines. Actually, the packages that you can download
below come with some programming examples and a detailed html documentation
that you are suggested to read carefully before starting to use the two libraries.
Two necessary softwares
In order to use our libraries you need two software tools.
doxygen:is required to automatically build the html
documentation. It should be available under your Linux distribution; in any
case, you can download it from the official
site. In case, you can download directly the documentation from
here.
agrep:is a sophisticated pattern-matching utility
for uncompressed files proposed by Wu-Manber (1992). It is actively
maintained, and it is free for non-commercial use, downloadable from its
official site (rpmhere). Agrep is used by our software
to perform the complicated searches over the dictionary of tokens; so that
if you wish to substitute it, you should find or implement an equivalent (both
in functionality and interface) program.
The command line
We also provide
two commands for the impatient users that wish to play with our software:
huffw and cgrep.
huffw
allows to (de)compress a textual file according to the Huffword paradigm.
Some of the options to be used are:
-d
for decompression
-p nto choose the parsing rule: 0 plain, 1 spaceless
example: huffw -p 1 filename.txt
# produces a compressed file with extension .hwz according to the spaceless model
cgrep
allows to search for complicated patterns over the Huffword-compressed files.
The options to be used follow a simple scheme:
The double dash distinguishes between the cgrep and agrep options. The cgrepand agrep options are numerous, so the we
refer the user to the manpage of cgrep and agrep. Nonetheless we point out
here some of them in ordre to give a glimpse of the features of this tool.
cgrep
options:
-c
count
the occurrences
-w
number of words before every occurrence to be printed (snippet)
-p n
proximity window of n words
-x
prints the snippet by escaping the unprintable bytes of value n as [\n]
-b
prints
boldface the occurrences in the snippet
-m "x"
"y" prints the snippet by preceding the
occurrences by x and following them
by y (useful for cgi)
Before
detailing some agrep options, we
stress here that
each pattern to be searched by cgrep
is resolved initially by agrep within
the dictionary, which consists of a token per line. So that when you formulate
a pattern query, you must remember that things like: "^pippo.*" will match
all tokens (not lines) beginning with 'pippo'.
-#
number of errors allowed
in the pattern
-w
the pattern must
match an entire word
-i
case insensitive search
"regexp"
the regexp is resolved against the words in the dictionary
example:
cgrep
-b -p 5 -- -w -i -1 "god" "^ev" filename.txt.hwz
# searches for all the occurrences of "god" (with one admitted error and
case-insensitive) and of a token starting with "ev" within a window of 5
words inside the file compressed via huffwd. Each occurrence is printed in
VT100 boldface.
We remark
here that the user has to check the manpage
of the two commands, and the README files of the two libraries, in order
to find all the details and options for those commands above.
The competitor: ZGrep
ZGrep is available under Linux/Unix systems
and allows to search over gzip-ped
files. Since ZGrep is just a combination of Grepand Gunzip, its searching cost is proportional
to the length of the entire uncompressed file; moreover, the pattern-matching
functionalities offered by ZGrep are limited to exact or regexp matches.
Conversely
CGrep offers new IR-functionalities (errors, proximity, multi-pattern specification,
snippet extraction capabilities,....) and better searching performances (it
searches directly over the compressed file).
On a 100Mb
textual file containing AP-news, we experimented a compression ratio close
to 37%,
comparable to the one achieved by 'gzip-9'. Searching for a regular expression
on a PC with an Athlon 1.8Ghz and 1Gb main memory took: 0.5
secs with CGrep, and 2.2 secs with ZGrep. A proximity search, not supported
by ZGrep, took 1.6
secs with CGrep.
Notice that
by searching a word with two errors via AGrep would take 2.6 secs, cfr to
0.9 of CGrep.
In summary,CGrep is about four times faster than ZGrep, and additionally it supports fast advanced IR-searchesthat are completely absent
in the Grep-family tools.
Then, download agrep from its official site,or
take a copy of its rpm from here.
Now, you are ready to install the software following these steps.
> tar xvfz CompressedSearch.tgz
> cd CompressedSearch;
make
... this creates two new dirs: lib/ and bin/ and the html-documentation
(provided that you have doxygen). lib/ contains
.a and .h necessary to use our libraries in your software; bin/ contains
the two commands huffw
and cgrep.
If you change your identity to be root
and run
> make install
...then you will install huffw
and cgrep into /usr/local/bin/
and their documentation as a manpage.
If you have doxygen installed
on your machine and you wish to create by yourself the documentation you must
issue:
> make doc
If you wish to play with the software, then you can download a test file:
bible.txt.bz2,
uncompress it with bzip2 -d, and
then try the following:
> huffw -p 1 bible.txt
...generates a .hwz compressed file
achieving 33% compression ratio.
...prints the snippets (with unprintable chars escaped and one empty line
in between) that contain all occurrences of one word equal to "Jesus" and
one word equal to "Jude" (case-insensitive and with at most one error) at
distance no more than 5 words.
The Windows package and its installation
We have also available (thanks to Iwona Bialynicka-Birula) a user-friendly interface for CGREP under
Windows OS, that allows you to use a reduced set of its
functionalities without learning its command-line syntax. By this
interface you can choose which files or directories to (de)compress,
or you can compose a search for an exact word or a substring, or an
approximate variant of it. The search can be executed over a set of
selected files or directories, and the results (text excerpts of
pattern occurrences) will be displayed graphically. To install the
package you just need to download the executable and click on it, thus selecting its
destination directory (here is the source).
The license
CGrepLib and HuffwLib is licensed under LGPL. LGPL means that
you are free to copy, modify and redistribute this code as long as it remains
under LGPL terms. However, you are allowed to use this code as library in
non-LGPL packages. Note, however, that this software uses 'agrep'. agrep
is freely usable only for non-commercial purposes, therefore if you plan
on using this library for commercial purposes, you should either get a license
of agrep, or substitute agrep with an
equivalent (both in functionality and interface) program.