FM-Index Version 2

Paolo Ferragina and Rossano Venturini

Dipartimento di Informatica, University of Pisa, Italy

The present software offers a new implementation of the compressed full-text index data structure, called FM-index.
The FM-index was proposed by Paolo Ferragina and Giovanni Manzini in [FOCS '00]. An implementation of the FM-index (vers. 1.0) was presented in [SODA '01]. This data structure combines compression and indexing by encapsulating in a single compressed file both the original file plus some indexing information. For this reason it is named self-indexing tool. The space occupancy of the FM-index is close to the one required by the best known compressors, like bzip2. But additionally to a compressor, the FM-index is able to efficiently support substring search operations, and the decompression of portions of the original file. Every such operation is executed on the FM-index by looking only at a small portion of the compressed file, thus requiring few milliseconds on a commodity PC over files of several megabytes.

How does it work?

The novelty of the FM-index resides in the careful combination of the Burrows-Wheeler compression algorithm and the suffix-array data structure to obtain a sort of compressed suffix array. The resulting index is opportunistic since it takes advantage of the compressibility of the input data for decreasing the index space. This compression is achieved at no significant asymptotic slowdown in the query time with respect to the fastest known full-text indices (i.e. the suffix tree and the suffix array).

The FM-index supports four main operations:

  • Operation count computes the number of occurrences of an arbitrary pattern P as a substring of the indexed file T, in time proportional to the length of P. Hence, the running time of count does not asymptotically depend on the length of T.

  • Operation locate retrieves the position in T of one occurrence of the pattern P in O(logcn) time, where c is a positive constant chosen at the time the FM-index is built.

  • Operation extract receives in input a position Pos and a positive integer L, and returns the substring T[Pos, Pos+L-1]. This operation can be deployed to decompress the whole file T.

  • Operation display receives in input a pattern P and a positive integer L, and displays L characters surrounding each occurrence of the pattern P in T (overall 2L+|P| characters).

In practice all operations decompress just a tiny portion of the compressed file (typically a few kilobytes), thus taking a few milliseconds on a commodity PC even when the indexed files sum up to several megabytes.

The structure of the FM-index consists of two parts: a core containing the basic indexing data structure, and an auxiliary set of information which are necessary to support the locate and extract operations, only. The FM-index allows to trade space occupancy for search time, by increasing/decreasing the size of the core and/or of the auxiliary information.

The new implementation of the FM-index introduces some significant changes in the core and in the auxiliary information, as well it adopts new algorithmic solutions for the implementation of some basic steps of the FM-index inner working. This is the reason why we denote it as FM-index vers. 2 when we refer to the this new version.

What is changed?

The new version of FM-Index modifies the marking strategy in order to improve the performance of the locate operation. The index marks logicaly and uniformly the text by adding a special character every M characters of the original text. This way, all the BWT rows that start with that special character are contiguous, and for them we keep their explicit text position.

The count algorithm must take into account the presence of the special character within the indexed text. Therefore, if a pattern P[1,p] has to be searched, we actually search for (p-1) patterns obtained by inserting the special characters in P equally spaced of M positions, plus we search for the pattern P itself. This search is implemented in parallel over all patterns by exploiting the fact that at any step i, we have to search either for P[p-i] or for the special character. This way, the overall search cost becomes quadratic in the pattern length. The output is a set of at most p ranges of rows.

The locate algorithm proceeds for at most M phases as follows. At phase 0, S is the single range of rows to be located. At phase k, S contains rows that may be reached in k "backward" steps from the rows to be located. Therefore, at any step, S consists of a set of ranges of rows. To mantain the invariant, the algorithm picks up a range of S, let it be [a,b], and determines in parallel the z distinct characters that occur in the substring BWT[a,b]. Then it executes z backward steps, one per character, thus originating z new ranges of rows. The induction is preserved because the positions of these rows are at distance (k+1) from the rows to be located. The algorithm cycles over all ranges of S, and thus determines new ranges that will form the new set S. Notice that if a range of rows starts with the special character, its position in the indexed text is explicitly stored, so that the position of the corresponding row to be located, can be inferred by summing k. This range can be dropped from S. At most after M phases the set S will be empty, given that special characters are located at distance M in the original text.

Where can I find it?

The software package is available at the Pizza&Chili site. In order to work properly, the FM-index needs the ds_ssort library, designed by Giovanni Manzini. However, you do not need to download this library because it is included within the present package.

To compile the FM-index software you must issue the command make in the FM-index home directory. This creates three useful commands:

  • fm-build: a command to build the FM-index data structure from an input file;
  • fm-search: a command to search the FM-index, or visualize parts of the indexed text (snippets);
  • fm-index.a: the FM-index library including all compression and indexing functionalities.

The first two commands, fm-build and fm-search, allow to use the FM-index either as a full-text search engine, or as a compressed storage engine that allows to display parts of the indexed-compressed text without its full decompression.
The library fm-index.a allows to develop your own C/C++ programs that deploy the functionalities offered by the FM-index.

How to use command-line programs

Some examples of usage follow (try also fm-build -h and fm-search -h):

# fm-build MyFile -o IndexedMyFile
creates IndexedMyFile.fmi which contains the FM-index of the text myfile.

Other options are available for fm-build in order to modify the size of the information stored in the core and in the auxiliary block. This allows to trade space versus time efficiency of the indexing and searching operations. For example:

# fm-build -f 0.1 MyFile -o IndexedMyFile
creates a bigger IndexedMyFile.fmi that supports faster locate operations (10% of text positions are marked, see above)

# fm-build -f 0 myfile -o IndexedMyFile
creates a smaller IndexedMyFile.fmi that supports only count and display operations.

# fm-search -c foo IndexedMyFile
counts the number of occurrences of the pattern foo as a substring of MyFile. Note that the extention .fmi must be not specified.

# fm-search -l foo IndexedMyFile
finds and reports the positions of the occurrences of foo as a substring of MyFile

# fm-search -e 100 -n 10 IndexedMyFile
extracts the substring MyFile[100, 109]

# fm-search -s 10 foo IndexedMyFile
prints 10 characters surrounding each occurrences of the pattern foo in MyFile

# fm-search -d MyFileCopy IndexedMyFile
creates MyFileCopy which is a copy of the original file MyFile.

How to use the library

You can develop your own search engine by exploiting the functionalities offered by the FM-index library. In this case you need to include in your software the two files: fm-index.a and interface.h. Please have a look at the two sources fm_search_main.c and fm_build_main.c for working examples on the usage of the library.

We point out that the FM-index library follows the API suggested by the Pizza&Chili site for implementations of compressed full-text indexes. For further details on this initiative please have a look at one of the mirrors: the Italian mirror or the Chilean mirror. Please note that the function

int build_index( uchar *text, ulong length, char *build_options, void **index )

allows to indicate to the index its proper building options via the string parameter build_options. Each index available at the Pizza&Chili site has its own building options. Those ones of the FM-index are (see [SODA '01] for the terminology):

  • -B Bsize: where Bsize is the size in Kbytes of level 1 buckets (default 16);
  • -b bsize: where bsize is the size in bytes of level 2 buckets (default 512). bsize must divide Bsize*1024;
  • -f frequency: where frequency indicates the percentage of marked characters (default 0.02, namely 2%);
  • -a x: where x indicates the behavior of FM-index over text. [default x=1]
    • x == 0, the FM-index operates directly on text to build the suffix array. This means that you are responsible to allocate length+overshoot bytes for the text instead of length bytes, where overshoot is a value returned by the function init_ds_ssort() available within the ds_ssort library. You must include ds_ssort.h;
    • x == 1, the FM-index frees the allocated memory for text after using it. ds_ssort.h is not necessary;
    • x == 2, the FM-index makes its internal copy of text. ds_ssort.h is not necessary.

If you set build_options to NULL then you choose the default values.

References

Paolo Ferragina and Giovanni Manzini. Opportunistic data structures with applications. In Proc. 41st IEEE Symposium on Foundations of Computer Science (FOCS), Redondo Beach, CA, (USA), 2000.

Paolo Ferragina and Giovanni Manzini. An experimental study of an opportunistic index.In Proc. 12th ACM-SIAM Symposium on Discrete Algorithms (SODA), Washington DC, (USA), 2001.

Paolo Ferragina and Giovanni Manzini. Indexing compressed text. Journal of the ACM (JACM), 52, 4 (Jul. 2005), 552-581

Paolo Ferragina and Giovanni Manzini. An experimental study of a compressed index. Information Sciences. Vol 135, (2001), 13-28.



Send Mail to Us | © Paolo Ferragina and Rossano Venturini, Last update: September, 2005.