A tool to search compressed textual files

We provide a collection of C functions and data structures to support compression and searching over textual files.The starting point of our sofwtare libraries is the following result:

in which the authors present a variant of the Huffman compression scheme that allows to perform pattern-matching directly over the compressed file. The resulting tool has been called CGrep, just to underline the fact that it is a scan-based pattern-matching routine, similar to Grep, but now acting over compressed files.

We have actually developed two software libraries which implement the ideas above with some added features:

We remark that the above are libraries of C functions that can be adopted by anyone wishing to play with its features and possibly build more sophisticated search engines. Actually, the packages that you can download below come with some programming examples and a detailed html documentation that you are suggested to read carefully before starting to use the two libraries.

We also provide two commands for the impatient users that wish to play with our software: huffw and cgrep.

The double dash distinguishes between the cgrep and agrep options. The cgrepand agrep options are numerous, so the we refer the user to the manpage of cgrep and agrep. Nonetheless we point out here some of them in ordre to give a glimpse of the features of this tool.

cgrep options:

-c count the occurrences
-w number of words before every occurrence to be printed (snippet)
-p n proximity window of n words
-x prints the snippet by escaping the unprintable bytes of value n as [\n]
-b prints boldface the occurrences in the snippet
-m "x" "y" prints the snippet by preceding the occurrences by x and following them by y (useful for cgi)

Before detailing some agrep options, we stress here that each pattern to be searched by cgrep is resolved initially by agrep within the dictionary, which consists of a token per line. So that when you formulate a pattern query, you must remember that things like: "^pippo.*" will match all tokens (not lines) beginning with 'pippo'.

-# number of errors allowed in the pattern
-w the pattern must match an entire word
-i case insensitive search
"regexp" the regexp is resolved against the words in the dictionary

example: cgrep -b -p 5 -- -w -i -1 "god" "^ev" filename.txt.hwz
# searches for all the occurrences of "god" (with one admitted error and case-insensitive) and of a token starting with "ev" within a window of 5 words inside the file compressed via huffwd. Each occurrence is printed in VT100 boldface.

We remark here that the user has to check the manpage of the two commands, and the README files of the two libraries, in order to find all the details and options for those commands above.

ZGrep is available under Linux/Unix systems and allows to search over gzip-ped files. Since ZGrep is just a combination of Grepand Gunzip, its searching cost is proportional to the length of the entire uncompressed file; moreover, the pattern-matching functionalities offered by ZGrep are limited to exact or regexp matches.

Conversely CGrep offers new IR-functionalities (errors, proximity, multi-pattern specification, snippet extraction capabilities,....) and better searching performances (it searches directly over the compressed file).

On a 100Mb textual file containing AP-news, we experimented a compression ratio close to 37%, comparable to the one achieved by 'gzip -9'. Searching for a regular expression on a PC with an Athlon 1.8Ghz and 1Gb main memory took: 0.5 secs with CGrep, and 2.2 secs with ZGrep. A proximity search, not supported by ZGrep, took 1.6 secs with CGrep.

Notice that by searching a word with two errors via AGrep would take 2.6 secs, cfr to 0.9 of CGrep.

In summary,CGrep is about four times faster than ZGrep, and additionally it supports fast advanced IR-searches that are completely absent in the Grep-family tools.