Paolo Ferragina --- Software Projects

menu:

Highlights

Current Projects

`[2009:current]`	TAGME is a software tool for the on-the-fly annotation of short text fragments with Wikipedia pages. [short paper at CIKM 2010.]
`[2009:current]`	The TagMySearch is an evolution of Snaket! for the clustering of web-search results into a hierarchy of folders labeled with variable-length sentences that capture the meaning of the results contained into each folder. This clustering engine deploys the algorithmic technology underlying TAGME.
`[2009:current]`	The BmF compressor is a combination of Bentley-McIlroy's compressor, with Gzip or Lzma compressors as post-compression tools. The former combination implements the idea published in BigTable's paper [Procs OSDI '06] by Google's researchers. The latter is a simple idea of mine which deploys the powerful lzma's compressor, and still keeps about the fast decompression performance of gzip. The software is designed to compress and randomly access pages from a Web collection in WARC format. You can nevertheless easily customize it!
`[2006:current]`	SmallText is a 100% pure Java library designed to compress and access a huge static textual DB at various levels of granularity. The DB consists of one single textual file, logically decomposed into records, each consisting of a sequence of fields having variable length and being in variable number among the records. Given the DB, Smalltext allows to compress it and to access any of its records and fields efficiently, without requiring the whole decompression of the DB.
`[2005:current]`	The Pizza&Chili Site: A collection of software libraries on Compressed Full-Text Indexes, and some testbeds. In cooperation with Gonzalo Navarro [read the full paper].
`[2005:current]`	The XBzip tool: A set of Java functions and data structures to compress and index large XML files. The index efficiently supports basic search functionalities like tree navigation and full-specified path searches, without requiring the whole decompression of the compressed XML data. This software implements the ideas introduced in IEEE FOCS '05 and WWW '06, here specialized to work on XML files. These ideas have got a US Patent in April 2012, owned by University of Pisa and University of Rutgers. The Java library has been developed in collaboration with Andrea Canciani.

Past Projects

`[2005:2007]`	The Compression Boosting Library: A set of C functions and data structures to implement the Compression Booster algorithmic tool introduced in the Journal of the ACM (52(4): 688-713, 2005).
`[2004:2007]`	The Tauro Search Engine: A novel search engine for XML data, with a user-friendly interface that allows you to build, index and navigate efficiently your XML collection. In cooperation with Signum, the research center on humanities-computing of the Scuola Normale Superiore, Pisa.
`[2006]`	A software to classify and cluster Biological Sequences and Structures, via Compression.
`[2006]`	The search engine Google search on Domains allows to restrict a user query onto a set of domains specified by the user. An example set of domains taken from the philosofical context is offered (thanks to Signum), and a web service is made available to be used by other applications.
`[2006]`	The Bio-Prompt Box indexes UNIPROT data, and uses various ontology-based hierarchical clustering strategies to provide different views over the query results. The ultimate goal of the clustering process is to provide the biologist with several different readings of the (maybe numerous) query results, and to show possible hidden correlations among them, thus improving their browsing and understanding. This approach is efficient and effective, and could thus be applied successfully to larger databanks, like GenBank or EMBL.
`[2001:2006]`	The FM-index: A compressed full-text index for substring searches on raw data. Check DrDobbs Journal and CT Magazine and Journal of the ACM.
`[2004:2006]`	SnakeT: A personalized search engine based on web-snippet hierarchical clustering.
`[2005]`	Rank Comparison Engine: A tool to compare the rankings of 15 top web search engines.
`[2005]`	Anagrammando: A search engine to play with Italian anagrams (Anagrammi di Frasi in Italiano, progetto studenti Corso AIR 2005).
`[2005]`	A C Library including basic functionalities to compress integer sequences.
`[2003:2004]`	The CompressedSearch Library (CGrep): A library to compress and search textual files with advanced pattern-matching features.
`[2002:2004]`	The XCDE Library: A library of state-of-the-art algorithms and data structures for compressing and indexing XML data. Used by Informatica Umanistica.
`[2004]`	The Lightweight Suffix Array Construction: A library to construct Suffix Arrays and LCP Arrays, fast and in small working space. It has been used in the LEDA Library.