About me
I am currently a research scientist at Facebook. I am mainly interested in the storage and indexing of large amounts of data, and applications to information retrieval and search engines. I am also interested in machine learning and computer vision.
I received my Ph.D. from the University of Pisa, where I also did a postdoc. After that I did another postdoc at ISTI-CNR. During my studies I worked for Ask.com and Bing, and done two internships at Microsoft Research Cambridge. You can find my CV here, or you can have a look at my LinkedIn page.
Publications
-
Faster BlockMax WAND with Variable-sized Blocks
In Proceedings of the
International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR), 2017
-
Compressing Graphs and Indexes with Recursive Graph Bisection
In Proceeding of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016
[arXiv] -
Fast Scalable Construction of (Minimal Perfect Hash) Functions
In Proceedings of the International Symposium on Experimental Algorithms (SEA), 2016
[arXiv] -
Optimal Space-Time Tradeoffs for Inverted Indexes
In Proceedings of the ACM
International Conference on Web Search and Data Mining (WSDM),
2015
[pdf] [code] -
Fast and Space-efficient Entity Linking in Queries
In Proceedings of the ACM
International Conference on Web Search and Data Mining (WSDM),
2015
[pdf] [code] -
Fast Compressed Tries through Path
Decompositions
In ACM Journal of Experimental
Algorithmics, October 2014
[pdf] [code]
An earlier version appeared in Proceedings of the Meeting on Algorithm Engineering and Experiments (ALENEX), 2012
[arXiv] [slides] -
Partitioned Elias-Fano Indexes
In Proceedings of the
International ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR), 2014
Best Paper Award!
[pdf] [code] [slides (pptx)] [slides (pdf)] -
Cache-Oblivious Peeling of Random
Hypergraphs
In Proceedings of the Data
Compression Conference (DCC), 2014
[extended version on arXiv] [code] -
Space-Efficient Data Structures for
Collections of Textual Data
Ph.D. thesis, June 2013
[pdf] -
Design of Practical Succinct Data
Structures for Large Data Collections
In Proceedings of the Symposium on
Experimental Algorithms (SEA), 2013
Invited paper
[pdf] -
Compressible Motion Fields
In Proceedings of the Conference on
Computer Vision and Pattern Recognition (CVPR), 2013
[pdf] [supplementary material] -
Space-Efficient Data Structures
for Top-k Completion
In Proceedings of
the 22st World Wide Web Conference (WWW), 2013
[pdf] [slides] -
The Wavelet Trie: Maintaining an
Indexed Sequence of Strings in Compressed Space
In Proceedings of the
Symposium on Principles of Database Systems (PODS),
2012
[arXiv] [slides] [poster] -
Semi-Indexing Semi-Structured Data in Tiny Space
In Proceedings of the 20th ACM
Conference on Information and Knowledge Management (CIKM),
2011
[pdf] [slides] [code]
Code
- ds2i: A collection of algorithms and data structures for inverted indexes, described in the papers Partitioned Elias-Fano Indexes and Optimal Space-Time Tradeoffs for Inverted Indexes.
- emphf: A minimal perfect hashing library for large-scale key sets focused on speed and low memory usage. This implementation is significantly faster than similar libraries such as cmph. The algorithms implemented in the library are described in the paper Cache-Oblivious Peeling of Random Hypergraphs.
- path decomposed tries: An implementation of the data structures described in the paper Fast Compressed Tries using Path Decomposition.
- semi-index: An implementation of the algorithm described in the paper Semi-Indexing Semi-Structured Data in Tiny Space. The datasets used in the experiments can be downloaded here.
-
succinct:
A collection of C++ succinct data structures. Many other
implementations of the same structures already exist; this
library however has some rare or unique features: all the
code is 64-bit clean, so it can support very big
datasets. Furthermore, the serialization scheme is designed
to allow the binary data to be memory-mapped instead of
loaded into memory.
It also fares extremely well compared to other implementations. In particular, the data structure used for balanced parentheses is very fast.
-
bpt:
A tool to compile and install packages in a sandboxed
directory. Like virtualenv,
but not restricted to python packages.
It is very useful during development, when specific versions of the libraries must be compiled and installed. The directory containing the sandbox can be moved (even to another machine) after the packages have been compiled, even if during compilation the prefix path is hardcoded in the compiled files.
It works under Linux and OSX. -
python-llvm-jit:
An implementation of a JIT compiler on top of Python
2.5.2. It works by translating the Python bytecode into LLVM
bitcode which calls back the CPython runtime, basically
unrolling the interpreter loop.
No more work was done on this because of not very promising results (in particular the compilation times are very high with LLVM 2.5, but the situation may have improved with newer versions of the framework), and because Unladen Swallow was announced shortly after, which uses roughly the same approach. A short report on my experiments was posted in this thread in the Unladen Swallow mailing list. - inpytex: A Python-based pre-processor for (La)TeX files. It executes Python snippets in comments and inserts their output under the comment. This script was written while working on my master's thesis to automate the creation of some TiKz figures and tables. It has not been used or maintained since then (except minor cosmetic changes).
Contacts
- e-mail (personal): giuott@gmail.com
- e-mail (work): ott@fb.com
- Twitter: ot_y