Xml Compressed Document Engine

Project leader: Paolo Ferragina
XCDE Kernel: Andrea Mastroianni
XCDE Query Engine: Alessandro Perrucci - Rosanna Rossi
XCDE Kernel windows porting: Giuseppe Prencipe
XCDE Query Engine windows porting: Alessandro Perrucci

 

 

 



Keywords: XML search engine, XML compression and indexing, XML query language, XML full-text query language.


We counted   hits.

What is it?

The XCDE Library is a native system written in C to compress, index and query XML files. The library includes an API that allows users to store, index, compress and query XML documents and some commands (written using the API) for implementing higher-level queries and/or for document (de)compression operations. State-of-the-art algorithms and data structures were adapted to reach good time/space performance and to support efficiently some innovative features, as the extraction of well-formed portions of XML documents (snippets), proximity queries on words, approximate or regular expression search for words and structural queries. The library is modular, easy to use, and efficient. It’s not public domain, but free for non-commercial purposes and it comes with source.

Overview of the library

The main characteristics of the library are the following:

·       native: it operates at the bottom level just upon the File System. This solution allowed designers to control all the details of the implementation, using specific algorithms and data structures by which it is possible to obtain better time/space performance;

·       modular: it was designed to ease future changes and expansions. The interface of the library is composed by C functions that implements access operations;

·       efficient: the space occupancy is minimized by using effective compression methods, while the fast access to the compressed document is allowed by using efficient data structures;

·       oriented to textual documents: it creates a set of data structures that make very efficient to work on documents with large textual parts. Nonetheless it has good performance also with other types of documents;

·       independent of the DTD: it doesn't require the presence of the DTD (Document Type Definition) to manage documents;

·       state-of-the-art algorithms and data structures: being the system native, it has been possible to choose in the design the best solutions for each implementation issue;

·       flexible: the user has the control of the behavior of the library. For example he/she can choose to build or not some data structures, thus avoiding the storage of data that aren't needed;

·       single document granularity: each document is compressed and indexed individually. This has one significant disadvantage and various advantages. It's clear that the fine granularity makes time consuming the search of highly populated collections (neverhless this will be managed in the next version of the library). On the other hand single file indexing allows to easily implement distributed indices, allows to trivially update the index as a result of insertions/deletions of documents, allows to easily take care of the heterogeneity of XML documents and also allows to customize the indices according to the feature of each indexed file;

·        free and portable: the library uses other libraries and commands (Zlib, Expat, Agrep) with licenses that allow the free distribution for non-commercial uses. The code has been written completely in standard ANSI C and has been tested under Linux (RedHat 7.2, Mandrake 8.2), and doesn't contain functions that depend from the O.S., so that can be easily ported onto other architectures.

·        query language which is simpleand IR oriented. Its specialties are described below:

·       The query syntax is similar to SQL: SELECT-FROM-RETURN, but here the SELECT clause is specified by means of an XML-piece of well-formed text. As a result, every user which knows a little bit of XML can formulate the query without being forced to learn the XPath syntax, as Xquery requires. Most of the IR functionalities detailed in the document http://www.w3.org/TR/xquery-full-text-requirements/ have been implemented as well other powerful string-based queries are supported, like regular expressions and error matches.

·       The output of the query (the snippet) can be formatted via the RETURN clause which, again, includes an XML-piece of well-formed text . This allow to identify one or more “interesting points” in each document subtree that matches the query, and indicate the way in which these interesting points must be visualized. Another specialty of the snippet extraction process is the possibility for the user to define the size of the snippet to be extracted, the presence or not of the tags, the well-formedness of the snippet, the retrieval of all elements including a given document part.

What's new?

The library provides some nice features:

·       snippets extraction: it's possible to extract efficiently well-formed portions of XML documents that can be processed or used to interact with other applications based on XML;

·       proximity query: using proximity lists is possible to implement queries on words proximity;

·       structural query: it's possible to implement efficiently stabbing queries, to relate document structure and its textual part. For example, it can be efficiently supported queries on tag-tag nesting and/or word-tag nesting;

·       full-text query: it's possible to perform full-text queries searching for word (prefix, suffix, substring, regular expression, approximate search, etc).

Applications

The library provides an API with a rich set of C functions developed with the aim of providing programmers with a basic block for designing complex IR applications on XML documents. For testing the library we developed a simple,inline query language by which is possible to make searches both on document structure (e.g. tag-tag nesting or tag-word nesting) and on classical search patterns (prefix, suffix, substring, regular expression, approximate search, etc). Through the API is possible to implement most of the basic functionalities of Xquery, and it may support more complex IR-like searches. It goes without saying that the XCDE Library is intended just as a kernel of a more complex XML query engine.

We have used the library to design an XCDE search engine for a collection of Italian literary texts marked with TEI. These texts contain significant textual parts, the hierarchical structure is deeply nested and rich of different tags, and the attribute values are complicated strings of numbers and letters. A humanist user wishes to execute full-text word-based, as well regular expression, searches both on the textual content of documents as well on its framework (tag names, attribute names and values).

We have also designed two forms of parallel variants of the XCDE library within the ASSIST framework (download from here the version of ASSIST we used for this test, and from here the XCDE variant used in the parallelization). The former variant uses a data-parallel approach which issues several processes scanning in parallel all the files interested by the user query. This tries to overcome the bottleneck induced by the single-file granularity of XCDE. The latter variant decomposes the tree-shape of the query into some subqueries that are executed in parallel and then recombines their results. Given the high engineering of XCDE on the single processor, the former variant seems the most efficient.
Linux distribution package

      XCDE Library (Kernel)
      XCDE Library (kernel + query engine, alpha version)
    XCDE - Kernel     XCDE - Query Engine
    XCDE - Kernel (Technical Paper in Italian)     XCDE - Query Engine (Technical Paper in Italian)
      XCDE - Kernel documentation       XCDE - Query Engine documentation


Windows distribution package - to be relased soon...

      XCDE Library
    To download ask login and passwd to prencipe@di.unipi.it

Copyright notes - see more...

Permission is granted to copy the XCDE Library package, to redistribute it on a non-profit basis, and to use it for research, teaching and fun, subject to the following restrictions and understandings:

·     Any copy made must include the copyright notice in full, included notices of other libraries used;

·     All materials developed as a consequence of the use of this software shall duly acknowledge such use, in accordance with the usual standards of acknowledging credit in academic research;

·     The use and redistribution is allowed only for research and teaching purposes. Any other use requires the express, written permission of the authors;

·     The authors have made no warranty or representation that the operation of this software will be error-free or suitable for any application, and they are under no obligation to provide any services, by way of maintenance, update, or otherwise. The software is an experimental prototype offered on an as-is basis;

·     Look at copyright notes of the other included libraries: Expat, Zlib, Agrep.