C Library to search over compressed texts

We counted hits.


A brief overview

We provide a collection of C functions and data structures to support compression and searching over textual files.The starting point of our sofwtare libraries is the following result:

in which the authors present a variant of the Huffman compression scheme that allows to perform pattern-matching directly over the compressed file. The resulting tool has been called CGrep, just to underline the fact that it is a scan-based pattern-matching routine, similar to Grep, but now acting over compressed files.

We have actually developed two software libraries which implement the ideas above with some added features:

We remark that the above are libraries of C functions that can be adopted by anyone wishing to play with its features and possibly build more sophisticated search engines. Actually, the packages that you can download below come with some programming examples and a detailed html documentation that you are suggested to read carefully before starting to use the two libraries.


Two necessary softwares

In order to use our libraries you need two software tools.


The command line

We also provide two commands for the impatient users that wish to play with our software: huffw and cgrep.

        example:   huffw -p 1 filename.txt       # produces a compressed file with extension .hwz according to the spaceless model
cgrep [cgrep options]  --  [[agrep options] "pattern"]+  file_name

The double dash distinguishes between the cgrep and agrep options. The cgrepand agrep options are numerous, so the we refer the user to the manpage of cgrep and agrep. Nonetheless we point out here some of them in ordre to give a glimpse of the features of this tool.
example:   cgrep -b -p 5  -- -w -i -1 "god" "^ev" filename.txt.hwz    
# searches for all the occurrences of "god" (with one admitted error and case-insensitive) and of a token starting with "ev" within a window of 5 words inside the file compressed via huffwd. Each occurrence is printed in VT100 boldface.

We remark here that the user has to check the manpage of the two commands, and the README files of the two libraries, in order to find all the details and options for those commands above.


The competitor: ZGrep

ZGrep is available under Linux/Unix systems and allows to search over gzip-ped files. Since ZGrep is just a combination of Grepand Gunzip, its searching cost is proportional to the length of the entire uncompressed file; moreover, the pattern-matching functionalities offered by ZGrep are limited to exact or regexp matches.

Conversely CGrep offers new IR-functionalities (errors, proximity, multi-pattern specification, snippet extraction capabilities,....) and better searching performances (it searches directly over the compressed file).

On a 100Mb textual file containing AP-news, we experimented a compression ratio close to 37%, comparable to the one achieved by 'gzip -9'. Searching for a regular expression on a PC with an Athlon 1.8Ghz and 1Gb main memory took: 0.5 secs with CGrep, and 2.2 secs with ZGrep. A proximity search, not supported by ZGrep, took 1.6 secs with CGrep.

Notice that by searching a word with two errors via AGrep would take 2.6 secs, cfr to 0.9 of CGrep.

In summary,CGrep is about four times faster than ZGrep, and additionally it supports fast advanced IR-searches that are completely absent in the Grep-family tools.


The Linux package and its installation

First of all, please download the software package and possibly the html documentation, if you do not have doxygen.

Then, download agrep from its official site, or take a copy of its rpm from here.

Now, you are ready to install the software following these steps.

> tar xvfz CompressedSearch.tgz                          

> cd CompressedSearch; make                                           

... this creates two new dirs: lib/ and bin/ and the html-documentation (provided that you have doxygen)lib/ contains .a and .h  necessary to use our libraries in your software; bin/ contains the two commands huffw and cgrep.

If you change your identity to be root and run

> make install

...then you
will install huffw and cgrep into /usr/local/bin/ and their documentation as a manpage.

If you have doxygen installed on your machine and you wish to create by yourself the documentation you must issue:

> make doc


If you wish to play with the software, then you can download a test file: bible.txt.bz2,  uncompress it with bzip2 -d, and then try the following:


> huffw -p 1 bible.txt    

...generates a .hwz compressed file achieving 33% compression ratio.


> cgrep -x -s "" -p 5 --  -w "Jesus" -w -i -1 "Jude" bible.txt.hwz          

...prints the snippets (with unprintable chars escaped and one empty line in between) that contain all occurrences of one word equal to "Jesus" and one word equal to "Jude" (case-insensitive and with at most one error) at distance no more than 5 words.

 

The Windows package and its installation

We have also available (thanks to Iwona Bialynicka-Birula) a user-friendly interface for CGREP under Windows OS, that allows you to use a reduced set of its functionalities without learning its command-line syntax. By this interface you can choose which files or directories to (de)compress, or you can compose a search for an exact word or a substring, or an approximate variant of it. The search can be executed over a set of selected files or directories, and the results (text excerpts of pattern occurrences) will be displayed graphically. To install the package you just need to download the executable and click on it, thus selecting its destination directory (here is the source).
 

The license

CGrepLib and HuffwLib is licensed under LGPL. LGPL means that you are free to copy, modify and redistribute this code as long as it remains under LGPL terms. However, you are allowed to use this code as library in non-LGPL packages. Note, however, that this software uses 'agrep'. agrep is freely usable only for non-commercial purposes, therefore if you plan on using this library for commercial purposes, you should either get a license of agrep, or substitute agrep with an
equivalent (both in functionality and interface) program.


For any questions, comments and suggestions, please contact the authors:Paolo Ferragina, Alessandro Tommasi, Giovanni Manzini.

»»» Please have a look at the Mark Nelson's view of this piece of work. «««

Last update, 11 June 2004.