Main Page   Alphabetical List   Compound List   File List   Compound Members   File Members  

CGrepLib.c File Reference

Library of C functions to be used in conjuction with HuffwordLib and agrep to achieve error-lenient pattern matching over compressed files. More...

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include "CGrepLib.h"

Go to the source code of this file.

Functions

int * CGrep_SearchPattern (int *nres, const char *ctext, size_t ctext_sz, const char *pattern, char **options)
int * CGrep_SearchSubstring (int *nres, const char *pattern, const char *ctext, size_t ctext_len, int errors)
int * CGrep_SearchWord (int *nres, const char *word, const char *ctext, size_t ctext_len, int errors)
proximity_hit_tCGrep_SearchProximity (int *nres, const char *ctext, size_t ctext_sz, int prox_window, char **patterns, char ***options)
int CGrep_GetMatchingCW (MyHash_table *ht, char *filter, char **options, int npattern, const Console *c)
int * CGrep_GetCWOccurrences (int *nocc, const MyHash_table *ht, const char *filter, const char *cbody, size_t cbody_len)
const char * CGrep_GetNextCWOccurrence (int *len, MyHash_table *ht, const char *filter, const char *cbody, size_t remaining)
proximity_hit_tCGrep_GetOccurrencesProximitySpaceless (int *nocc, int prox_window, int npatterns, const MyHash_table *ht, const char filter[], const MyHash_table *separators, const char sepFilter[], const char *cbody, size_t cbody_len)
proximity_hit_tCGrep_GetOccurrencesProximity (int *nocc, int prox_window, int npatterns, const MyHash_table *ht, const char filter[], const MyHash_table *separators, const char *sepFilter, const char *cbody, size_t cbody_len, const Hash_node *nl)
MyHash_nodeCGrep_CheckIfIsPattern (const char *cbody, int cw_len, const MyHash_table *ht, MyHash_node *hn, const char filter[])
char * CGrep_escapeStringConfigurable (const char *s, size_t len, char min, char max, const char *exceptions)
char * CGrep_escapeString (const char *s, size_t len)
void MyHashtable_init (MyHash_table *ht, int n)
void MyHashtable_clear (MyHash_table *ht)
int MyHashtable_func (const char *s, int len, const MyHash_table *ht)
MyHash_nodeMyHashtable_search (const char *s, int slen, const MyHash_table *ht)
int MyHashtable_insert (const char *s, int slen, int npattern, MyHash_table *ht)


Detailed Description

Library of C functions to be used in conjuction with HuffwordLib and agrep to achieve error-lenient pattern matching over compressed files.

Author:
Paolo Ferragina and Alessandro Tommasi, Dipartimento di Informatica, Pisa (Italy)
Date:
March 2003
CGrep is a C library that implements and extends what has been proposed in: deMoura, Navarro, Ziviani, Baeza-Yates, ACM Trans Info Syst, 2000. The main idea is to adopt a scan-based approach to search into a file compressed via a byte-aligned and tagged Huffword. The specialties of this compression method ensure that the search does not need to decompress the compressed file, but it may be executed directly on the compressed file via proper compression of the searched patterns. Actually the search consists of three steps: first the complex pattern query (possibly involving errors, reg exp, ....) is resolved against the dictionary of tokens, built to compress the file. The tokens that satisfy the query are coded and then their codewords are searched directly onto the compressed file, thus avoiding its decompression. This induces a twofold speedup: we pay a cost proportional to actually the length of the compressed file and we pay the cost to search for complicated patterns (like reg exp, errors, ....) only on the dictionary of tokens, which is small compared to the entire file. The additional features that we have implemented wrt the original paper are proximity query, arbitrary number of patters each with its search options, snippet extraction. Furthermore, we provide a library that can be adopted by anyone wanting to play with its features and possibly build more sophisticated search engines.

For more details please have a look at the html documentation.

This file is licensed under LGPL terms (see file LICENSE)

Definition in file CGrepLib.c.


Define Documentation

 
#define CGREP_GET_NEXT_CW  
 

Value:

{ \
    prevpos = currpos; \
    do { \
      currpos++; \
    } while ( (currpos<endpos) && ((*currpos & 0x80) == 0) ); \
    cw_len = currpos - prevpos; \
}

Definition at line 689 of file CGrepLib.c.


Function Documentation

MyHash_node* CGrep_CheckIfIsPattern const char *    cbody,
int    cw_len,
const MyHash_table   ht,
MyHash_node   hn,
const char    filter[]
 

Checks whether a codeword matches with one of the patterns sought for. ht must hold the codewords relative to the patterns. Because the same codeword can represent a word that matches more than one pattern, the function allows to be called repeatedly. If the parameter hn is NULL, the first match is returned. If hn holds the value previously returned, the next one is returned.

Parameters:
cbody  pointer to the beginning of the codeword in the compressed body
cw_len  length of the codeword
ht  hashtable holding the codewords sought for
hn  pointer to the previously returned value, must be NULL for the first call
filter  array of 256 elements holding 1's in correspondance with the first byte of codewords in ht (use a array of all 1's if you don't want to filter over the first byte)
Returns:
pointer to the next MyHash_node holding the match, NULL if no match is found

Definition at line 1033 of file CGrepLib.c.

char* CGrep_escapeString const char *    s,
size_t    len
 

"Default" version of CGrep_escapeStringConfigurable, using the defaults CGREP_MIN_PRINTABLE_CHAR (32), CGREP_MAX_PRINTABLE_CHAR (126), and CGREP_NONPRINTABLE_CHARS ("[]").

Definition at line 1106 of file CGrepLib.c.

char* CGrep_escapeStringConfigurable const char *    s,
size_t    len,
char    min,
char    max,
const char *    exceptions
 

Utility function: given a string, it copies it to another allocated string where some of the characters are escaped as: [<hex value>], where <hex value> is the hexadecimal value of the escaped character. The characters to escape are selectable by range and list: all character whose value is smaller than min, greater than max, or equal to one in exceptions are escaped.

Parameters:
s  the string to be escaped
len  the length of s
min  minimum printable character
max  maximum printable character
exceptions  null-terminated array of characters to escape
Returns:
an allocated, escaped version of s

Definition at line 1068 of file CGrepLib.c.

int* CGrep_GetCWOccurrences int *    nocc,
const MyHash_table   ht,
const char *    filter,
const char *    cbody,
size_t    cbody_len
 

Looks for the codewords contained within the hashtable in the compressed body. Returns the list of positions (in the compressed body) of the matches.

Parameters:
nocc  pointer to an integer that is filled with the length of the hitlist
ht  pointer to a MyHash_table holding the (tagged) codewords
filter  pointer to an array of 256 elements holding 1 corresponding to the positions of the first byte of codewords in ht.
cbody  buffer holding the compressed body
cbody_len  length of the compressed body
Returns:
an allocated array of positions within the compressed body.

Definition at line 607 of file CGrepLib.c.

int CGrep_GetMatchingCW MyHash_table   ht,
char *    filter,
char **    options,
int    npattern,
const Console *    c
 

Fills the hashtable ht with the codewords corresponding to all the words in the Dictionary obtained from Console c matching the search pattern as specified in agrep's options. 'Systems' agrep, I/O via file. The dictionary content is saved to '/tmp/agrep.tmp.PID.n', and the output of agrep to '/tmp/agrep.out.PID.n'. Temporary files are unlink'ed after the execution. "-n" or other flags that modify agrep's output format MUST NOT be within the options, and the buffer must contain a word per line. options[0] must be agrep's executable name, and the array must be null-terminated. Along with the hashtable (which must be initialized), if filter is not NULL, it is interpreted as an array of 256 unsigned char to fill with 1's in correspondance to the fisrt byte of every matching codeword found.

Parameters:
ht  pointer to a MyHash_table that is filled with the codewords corresponding to the tokens matching the search pattern
filter  if not NULL, this array of 256 unsigned chars is filled with 1's in correspondance with the first byte of every codeword inserted in ht
options  options passed to agrep (including the search pattern and the executable name)
npattern  id of the search pattern. Used to distinguish among matches against different patterns in proximity search (each pattern should have a different id)
c  pointer to the console of the compressed file
Returns:
the number of codewords inserted in ht.

Definition at line 580 of file CGrepLib.c.

const char* CGrep_GetNextCWOccurrence int *    len,
MyHash_table   ht,
const char *    filter,
const char *    cbody,
size_t    remaining
 

Looks for the codewords contained in the hashtable ht in the compressed body cbody. Returns the position of the first match found.

Parameters:
len  pointer to an integer that is filled with the length of the matching codeword
ht  pointer to a Hashtable holding the (tagged) codewords
filter  pointer to an array of 256 elements holding 1 corresponding to the positions of the first byte of codewords in ht.
cbody  buffer holding the compressed body
remaining  characters remaining to the end of the buffer
Returns:
a pointer to the first occurrence of a codeword, NULL if there are no occurrences

Definition at line 661 of file CGrepLib.c.

proximity_hit_t* CGrep_GetOccurrencesProximity int *    nocc,
int    prox_window,
int    npatterns,
const MyHash_table   ht,
const char    filter[],
const MyHash_table   separators,
const char *    sepFilter,
const char *    cbody,
size_t    cbody_len,
const Hash_node *    nl
 

This function is the same as CGrep_GetOccurrencesProximitySpaceless, but is intended for files compressed with spaces. Has an additional argument: a pointer to the hashnode relative to the newline entry in the dictionary (NULL if there is no newline).

Definition at line 861 of file CGrepLib.c.

proximity_hit_t* CGrep_GetOccurrencesProximitySpaceless int *    nocc,
int    prox_window,
int    npatterns,
const MyHash_table   ht,
const char    filter[],
const MyHash_table   separators,
const char    sepFilter[],
const char *    cbody,
size_t    cbody_len
 

returns an array of proximity_hit_t, proximity hits for the patterns over the compressed body of a spaceless-compressed text. The array returned is malloc'ed; each proximity_hit_t entry is composed of: * byte_position, the position in the compressed body of the beginning of the matching window * start_position, the rank of the first word of the matching window * end_position, the rank of the last word in the matching window * positions, an array of positions in the cbody of the match for each pattern * ranks, an array of ranks giving the rank of each pattern

Parameters:
nocc  pointer to an integer there the length of the array is stored
prox_window  width of the proximity window
npatterns  number of patterns sought after;
ht  pointer to a hashtable holding the codewords corresponding to the patterns
filter  this array must contain 1's in correspondance to the first byte of each codeword in ht. Must be full of 1's if a filter is not to be applied.
separators  pointer to a hashtable holding the codewords corresponding to the separators
sepFilter  this array must contain 1's in correspondance with the first byte of each codeword in separators. Must be full of 1's if a filter is not to be applied.
cbody  the compressed body to search onto
cbody_len  the length of the compressed body
Returns:
an allocated array of proximity_hit_t, NULL if no results are found.

Definition at line 734 of file CGrepLib.c.

int* CGrep_SearchPattern int *    nres,
const char *    ctext,
size_t    ctext_sz,
const char *    pattern,
char **    options
 

Search for a pattern on a compressed text using agrep. Agrep is invoked over the dictionary in order to find out tokens that match against the query string. The dictionary is in the format of a token per line: when using options to agrep or regexp, you must remember that things like: "^p.*" will match all WORDS (not lines) beginning with the letter 'p'. This is because agrep is invoked over the dictionary, that actually contains a token per line. Once the list of matching token is found, their corresponding codewords Are inserted into a hashtable, and the compressed file is scanned looking for such codewords.

Parameters:
nres  pointer to an integer that will be set to the number of occurrences found. Also used as initial array size.
ctext  pointer to the memory area holding the comressed file
ctext_sz  size of the compressed file
pattern  pattern to search for
options  options to agrep (null-terminated)
Returns:
an allocated array of positions within the compressed body of the file, at which codewords corresponding to matching tokens occur, NULL if nothing is found.

Definition at line 85 of file CGrepLib.c.

proximity_hit_t* CGrep_SearchProximity int *    nres,
const char *    ctext,
size_t    ctext_sz,
int    prox_window,
char **    patterns,
char ***    options
 

Performs a proximity search, returning an array of proximity_hit_t's Up to CGREP_MAX_PATTERNS (10) patterns are allowed, with all of agrep's options (regexp search, approximated search, case sentitiveness/insensitiveness, etc.) In case of non exact search, it may happen that a word matches multiple words. In this case, if a set of words matches the search in multiple ways, it is returned only once (the arrays positions and ranks in the corresponding proximity_hit_t will hold values for the first such match found.

Parameters:
nres  pointer to an integer that will hold the number of elements in the (allocated) array of results. If != 0 on call, it is used as the initial array size.
ctext  buffer holding the mmapped compressed file
ctext_sz  size of the compressed file
prox_window  width of the proximity window
patterns  array of strings holding the patterns to search for. The array must be NULL-terminated.
options  array of array of options to agrep (one array of options per pattern). Must be NULL-terminated in both dimensions.
Returns:
an allocated array of proximity_hit_t, NULL if no results are found.

Definition at line 262 of file CGrepLib.c.

int* CGrep_SearchSubstring int *    nres,
const char *    pattern,
const char *    ctext,
size_t    ctext_len,
int    errors
 

returns an array of positions of the pattern in the compressed body as a substring, allowing for at most errors errors. The array returned is malloc'ed. This function is a 'shortcut' invocation for CGrep_SearchPattern.

Parameters:
nres  pointer to an integer where the length of the array is stored, it is also used as the size of the first allocation.
pattern  is the pattern to search for
ctext  is the buffer holding the compressed file's data
ctext_len  length of the buffer
errors  number of errors allowed in the search
Returns:
a malloc'ed array of positions within the body.

Definition at line 152 of file CGrepLib.c.

int* CGrep_SearchWord int *    nres,
const char *    word,
const char *    ctext,
size_t    ctext_len,
int    errors
 

returns the array of positions of the word in the compressed body, with at most errors errors. The array returned is malloc'ed. This function is a 'shortcut' invocation for CGrep_SearchPattern.

Parameters:
nres  pointer to an integer where the length of the returned array is stored. Also used as initial array size.
word  the word to search for
ctext  the buffer holding the compressed file data (not just the body)
ctext_len  buf's length
errors  number of allowed errors
Returns:
an array of positions, malloc'ed by the function.

Definition at line 183 of file CGrepLib.c.

void MyHashtable_clear MyHash_table   ht
 

Frees all elements of a hashtable. After this call, ht is an empty, uninitialized MyHash_table.

Parameters:
ht  pointer to a MyHash_table

Definition at line 1145 of file CGrepLib.c.

int MyHashtable_func const char *    s,
int    len,
const MyHash_table   ht
 

Computes the hash value for the given string.

Parameters:
s  input string whose hash value has to be computed.
len  length of s.
ht  pointer to an hash table.

Definition at line 1167 of file CGrepLib.c.

void MyHashtable_init MyHash_table   ht,
int    n
 

Initialize the hash table according to the number of estimated tokens; the load factor is set to 0.1.

Parameters:
ht  pointer to an (empty) hash table.
n  estimated number of items to be inserted.

Definition at line 1124 of file CGrepLib.c.

int MyHashtable_insert const char *    s,
int    slen,
int    npattern,
MyHash_table   ht
 

Inserts the token in the hash table and returns 1 if new, 0 otherwise; it also updates the counter of occurrences for that token.

Parameters:
s  string to be inserted.
slen  length of s (to manage also the NULL char, WARNING slen MUST be smaller than 4).
npattern  pattern ID, useful for proximity search
ht  pointer to the hash table.

Definition at line 1211 of file CGrepLib.c.

MyHash_node* MyHashtable_search const char *    s,
int    slen,
const MyHash_table   ht
 

Searches for the given string into the passed hash table (NULL if not).

Parameters:
s  string to be searched.
slen  length of s (to manage also the NULL char).
ht  pointer to the hash table to be searched.

Definition at line 1186 of file CGrepLib.c.


Generated on Mon Mar 31 14:44:31 2003 by doxygen1.2.14 written by Dimitri van Heesch, © 1997-2002