Pisa - Dipartimento di Informatica - Research Evaluation Exercise 1999


Type based models
and query languages
for Web data




Proposer

Antonio Albano and Giorgio Ghelli.

Participants

Antonio Albano (Professor), Giorgio Ghelli (Associate Professor), Dario Colazzo, Paolo Manghi, Carlo Sartiani (PhD Students) (full time participation).

Title and Summary

Type based models and query languages for Web data. The research goal is the design of a type system and a programming language to query and manipulate semistructured data, with the following requirements: (a) flexibility of the type system, to make it possible to describe the common features of the elements of a semistructured collection; (b) expressive power of the query language, which must allow both data elements and their structure to be analyzed, and data with an irregular structure to be queried; (c) openness of the type system and the language, meaning that the collections to be queried may come from external sources, and in particular from the Web, and that the data produced by the language can be published on the Web, in particular in the XML format. The presence of a type system that allows some consistency checking is in contrast with the flexibility needs of semistructured data. The conciliation of these two aspects represents one of the crucial aspects of the proposed research.

State of the Art and Trends

The huge amount of data published via the World Wide Web has led to a number of research efforts on techniques to query and restructure semistructured data sources on the Web. In particular, several data models and query languages have been defined. According to [22], these languages can be classified as ``first generation languages'' - for example W3QL [34], WebSQL [35] - which only deal with a basic data model of pages and links, and ``second generation languages'', which also allow the data inside web pages to be modeled, by means of semistructured data models, such as the languages Lorel [1], UnQL [9], Strudel [21]. These proposals have been recently extended to the issue of querying XML documents; most of the proposals can be found in the proceedings of the W3C workshop QL'98.
Among second generation languages, two approaches have been considered to describe the structure of Web data. The first approach is based on the use of union types to define a schema to deal with data irregularities, while the second approach, the preferred one so far, is based on the use of self-describing data, modeled as a labeled graph, without a predefined schema.
An example of the first approach is the solution proposed to manage SGML documents in the O$_2$% WIDTH=12 HEIGHT=33 object-oriented database system with appropriate extensions of its data model [13]. An example of the second approach is the Stanford's Object Exchange Model (OEM), and a similar model has been also proposed at the University of Pennsylvania [8]. In OEM each object has an arbitrary number of attributes, with different or equal names, whose values are other objects or atomic data; a set S with n elements is represented as a record with n fields, all labeled S, each containing one element of the set. OEM is a ``lightweight object model'' in the sense that it supports a notion of ``object identity'' but it does not require the definition of classes or types, arbitrary structures with arbitrary attribute names can be modeled, and it does not support neither encapsulation nor object behavior.
The approach with tagged union types requires few modifications to a DBMS and preserves the advantages of traditional database languages with respect to the possibility of type-checking and query optimization [13]. However the use of tagged union types, and their related ``case'' operators, is not flexible enough to deal with semistructured data; for this reason, some researchers are currently investigating the possibility of switching to a more flexible approach based on untagged union types.
The approach with self-describing data has the advantage of flexibility, but suffers some important drawbacks: (a) data is inefficient to store, since the schema is replicated with each data item; (b) queries are hard to evaluate efficiently; (c) queries are hard to formulate and cannot be typecheked [23]. Indeed, a general problem with most SSD query language is the lack of any consistency control between data and query: when the query is written assuming a structure which does not correspond to the actual structure of data, no error is indicated, but an empty result is returned, even in the case of trivial errors such that the mistyping of a field name. The use of type systems as a foundation for languages for SSD is a new research direction, aiming to overcome this limitation, which is being followed in particular by the database group at University of Pennsylvania [10] and by the Hippo project in Glasgow [18].
Beside the data model, the other issue is the definition of the corresponding query language [8] [9] [1]. The common feature of query languages for SSD is the ability to traverse arbitrary long paths in the data, usually specified in the form of regular path expression, to allow users to write queries on data whose structure is not fully known (e.g. ``find the phone of a person which may be an attribute of a person or an attribute of an arbitrary subcomponent of a person''). Moreover one can query the schema components (e.g. ``find all attribute names''), or write queries which ignore part of the schema (e.g. ``retrieve all attribute values, regardless of the attribute name'') [23]. Query languages are used to express both queries and transformations: queries return a subset of a collection of data, while transformations may construct a new graph. The specific solutions differ with respect to their target application, expressive power, and restructuring capabilities [15].
Finally, query optimization for SSD is another issue which has received a limited attention. Among the subthemes of this issue, we cite in particular the evaluation techniques for path expressions and for queries on self-describing data [14] [32] [20].

Relevant Research Activities at the Department

The database group has focused in the last few years on the following research issues.

In the latest years, the activity on database programming languages has been supported by two research projects financed by the italian Research Council (CNR), by a project financed by the italian Ministry for Research and University (MURST), and by the following projects financed by the European Community: ESPRIT Basic Research Action FIDE 3070, ESPRIT Basic Research Action FIDE2 6309, and Esprit Working Group 22552 - PASTEL. The activity of type theory has been supported by ``Joint NSF/ESPRIT Workshop on Foundations of object-oriented programming languages'' and Esprit Working Groups 26142 - APPlied SEMantics.

Short Term Plans and Expected Results

The research goal is the design of a type system and a programming language to query and manipulate SSD coming from the Web, with the following requirements: (a) flexibility of the type system, so that the irregular structure of SSD can be described, (b) flexibility of the query language, to allow data structure to be analyzed and the irregular structure of SSD to be tolerated, (c) openness of the type system and the language, meaning that the collections to be queried may come from external sources, and in particular from the Web, and that the data produced by the language can be published on the Web, in particular in XML format. The presence of a type system that allows some consistency checking is in contrast with the flexibility needs of SSD. The conciliation of these two aspects is one of the crucial problems to be solved.
The research will be based on the theoretical framework of type theory. The past experience of the research group in database language design has shown that type theory has the necessary flexibility to deal with both regular and irregular structures. In particular, it seems that a combination of a variation of record types with recursive types can be used to typecheck OEM-like data. A promising research direction is the study of type systems which include also collection and untagged union types to achieve flexibility and expressivity. The relevance of the approach has been already recognized by other researchers (notably, R. Connor, P. Buneman, and B. Pierce), who have achieved interesting preliminary results. The query language will be OQL-like, extended with the operators provided by the proposed type constructors. Union types require a careful design of their operators to overcome the limitations of the "case" operator, provided by the traditional tagged unions, with operators based on pattern-matching on the expected value structure.
The openness of the type system and the language to data which comes from external sources is another basic requirement. Mapping techniques will be explored to import data with an XML format, or, more generally, with an OEM-like format. The focus will be on the mapping from an OEM-like type system to the reference type systems.
The possibility of publishing data on the Web raises two different related issues: the use of the language to build a Web site, and the use of a Web browser on data created by the language. The first issue requires a rich type system with constructors finalized to site design. On the other hand, the second issue requires that every piece of data which is described in our type system is representable on the Web. In both cases the attention will be on the structural aspects of data rather than on their presentation. The separation of these aspects will be made easier by the use of XML.
The research activity will be organized as follows.
The first phase will be focused on the definition of a data model, on the definition of the corresponding type system, with its operators, and on the definition of the basis of the technique to allow the integration of data coming from external sources.
In greater detail, we will define a data model and a type system, with the corresponding operators, based on record, collection, union, and recursive types, able to represent some important classes of semistrucured data. The crucial problem is the definition of a correspondence relationship between values and types, and between queries and types, which allows consistency checking without being too restrictive. We will also define a different type system, based on record types with repeatable labels, which is able to define SSD which comes from external sources. A notion of correspondence between these types and the ones defined in the language type systems will be defined, such that a mapping from external data to data described in the language type system will derive from this correspondence.
The second phase will be focused on the design of the language, and on the definition of the corresponding operational semantics and type checking algorithms.
We will first define a language to query and restructure SSD, and to write complete applications, based on the operators defined in the previous phase. The crucial problem is the definition of a query language which does not depend on rigid assumptions about the structure of the data, avoiding, at the same time, to end up with a language where every query can be applied to every piece of data without any control of the coherence between the two.
We will then define an operational semantics, a set of type rules, a subtype checking and a type checking algorithm for the proposed language. This formalization will be the basis to specify the language implementation.
During the third phase, the implementation of the defined language will start, in parallel with the study of the language properties, and with a study of the problem of access to external data.
The first implementation of the language will be based on a flexible architecture, but will be focussed on correctness and extensibility, rather than on efficiency. At the same time, we plan to prove the coherence between the language semantics and its type rules (strong typing). We will also work on the access to external data, designing the algorithms needed to operate the conversion between the external data and their internal mirrors.
The fourth phase will be focussed on experimentation with the prototype, and on laying the basis of a research on query optimization.
The prototype will be completed to implement the proposed language, including type-checking, query and application execution, and some external data access mechanism. The prototype will be used to experiment with the language usability and to gain some understanding about query execution.
In this phase we will also define a query execution model, which will be the basis for future studies on query optimization. The model shall deal, in particular, with the fact that different data sources have vastly different speed, offer different access mechanisms and different kinds of access paths.
In parallel with these four phases, we plan to carry on some foundational research on programming language and type theory, tackling at a more general level some of the semantic and type-theoretic problems which will arise. We will use these tools on those problems whose complexity requires the use of an appropriate generalization, and on those problems whose general interest make it useful a study at a general level.

Long Term Scenario

In a longer term period, we expect to expand our research from the language design field to the field of query optimization. This is a field which is rich of difficult problems, with important practical applications, even in simpler contexts.
Another interesting research direction which we may consider in a longer term scenario is the extension of the studied paradigm with code mobility. Code mobility is very convenient in the field of query execution, since moving a query where data resides may be vastly faster than moving input data towards the query. Here the basic problem to be solved is how to make it acceptable, for the data repository, the fact of spending its cycles to execute a query which comes from a different site, possibly containing some malicious code. This classical problem can be tackled in many different ways. An interesting approach is the certification of some properties of the mobile code, such as its general good-behavior, its origin, the data complexity of the query it intends to run. If we limited the mobile code to mobile queries, the problem would become much more tractable than in the general case of mobile agents.

Short CV's


Antonio Albano

Antonio Albano was born in S. Mango, Italy, on March 1, 1945. He received the Doctor Engineering degree in Electronic Engineering from the Politecnico di Milano, Italy, in 1968. From 1971 to 1987 he was Assistant Professor, and subsequently Associate Professor at the Department of Computer Science, University of Pisa. From October 1987 to November 1990 he was Professor at the Department of Mathematics and Computer Science, University of Udine. From November 1990 he is Professor at the Department of Computer Science, University of Pisa. >From September 1972 to August 1973 he was Research associate at the Department of Computer Science, New York University. Sabbatical semesters were spent as visiting professor at the Department of Computer Science, University of Toronto, the Escuela Superior Latino Americana de Informatica, Buenos Aires, Argentina, the Department of Computer Science, University of Sydney, Australia. Over the years A. Albano has worked in the area of image processing, computer aided design, artificial intelligence, and databases, in particular database programming languages. He is author of several journal articles and conference papers in these areas, author and co-author of four books (in italian), and co-editor of the books Computer-Aided Database Design: The DATAID Project, North-Holland, Amsterdam, 1985, and Persistent Object Systems, Springer-Verlag, 1993. He is currently engaged in research in object-oriented database programming languages and languages for semistructured data.
Selected Papers: [5,3,4,6,2].

Giorgio Ghelli

Giorgio Ghelli was born on 10/27/1961. He received his PhD in Computer Science from the Pisa University, Italy, in 1990. He is Associate Professor of Computer Science (Data Bases), Università di Pisa, since November 1992. He has been Visiting Professor at Ecole Normal Supérieure Paris (1993), and Visiting Researcher at Microsoft Research Center, Cambridge (UK) (1998). His research interests are: design and implementation of database languages; type theory, and its application to the previous theme; foundations of object-oriented languages; languages to query and program the Web. He has contributed to the design and implementation of the Galileo and Fibonacci object-oriented database languages. He has been involved in international projects in database languages and systems and on foundations of object oriented languages, either as a group member or as a site leader. He has been part of the program committee of international conferences and workshops devoted to database and object-oriented languages and systems. He published nine papers on refereed journals and twentyeight papers in international refereed conferences and workshops, coauthored with Antonio Albano, Luca Cardelli, Giuseppe Castagna, Richard Connor, Pierre-Louis Curien, Giuseppe Longo, Benjamin Pierce, and many others.
Selected Papers: [6,12,11,31,16].

Keywords

Database programming languages, Type systems, Data models, Semistructured data, World Wide Web.

Bibliography




1
S. Abiteboul, D. Quass, J. McHugh, J. Widom, and J. Wiener.
The Lorel query language for semistructured data.
Journal on Digital Libraries, 1:68-88, 1997.

2
A. Albano, G. Antognoni, and G. Ghelli.
View operations on objects with roles for a statically typed database language.
Submitted for publication, 1999.

3
A. Albano, R. Bergamini, G. Ghelli, and R. Orsini.
An object data model with roles.
In R. Agrawal, S. Baker, and D. Bell, editors, Proc. of the Nineteenth Intl. Conf. on Very Large Data Bases (VLDB), Dublin, Ireland, pages 39-51, San Mateo, CA, 1993. Morgan Kaufmann.

4
A. Albano, M. Diotallevi, and G. Ghelli.
Extensible objects for database evolution: Language features and implementation issues.
In Proc. of the Fifth Intl. Workshop on Data Base Programming Languages (DBPL), Gubbio, Italy, 1995.

5
A. Albano, G. Ghelli, and R. Orsini.
A relationship mechanism for a strongly typed object-oriented database programming language.
In G. Lohman, A. Sernadas, and R. Camps, editors, Proc. of the Seventeenth Intl. Conf. on Very Large Data Bases (VLDB), Barcelona, Spain, pages 565-576, San Mateo, CA, 1991. Morgan Kaufmann.

6
A. Albano, G. Ghelli, and R. Orsini.
Fibonacci: A programming language for object databases.
Journal of Very Large Data Bases, 4(3):403-444, 1995.

7
P. Baldan, G. Ghelli, and A. Raffaetà.
Basic theory of F-bounded quantification.
Information and Computation, 1999.
To appear.

8
P. Buneman, S. Davidson, and D. Suciu.
Programming constructs for unstructured data.
In Proc. of the Fifth Intl. Workshop on Data Base Programming Languages (DBPL), Gubbio, Italy, 1995.

9
P. Buneman, S. B. Davidson, G. G. Hillebrand, and D. Suciu.
A query language and optimization techniques for unstructured data.
In Jagadish and Mumick [33], pages 505-516.
SIGMOD Record 25(2), June 1996.

10
P. Buneman and B. Pierce.
Types for semistructured data.
Internal report, 1999.

11
L. Cardelli, G. Ghelli, and A. D. Gordon.
Mobility types for mobile ambients.
In Proc. of the 26th International Colloquium on Automata, Languages, and Programming (ICALP), Prague, Czech Republic, July 1999.

12
G. Castagna, G. Ghelli, and G. Longo.
A calculus for overloaded functions with subtyping.
Information and Computation, 117(1):115-135, 1995.

13
V. Christophides, S. Abiteboul, S. Cluet, and M. Scholl.
From structured documents to novel query facilities.
In R. T. Snodgrass and M. Winslett, editors, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, pages 313-324, Minneapolis, Minnesota, 24-27 May 1994.
SIGMOD Record 23(2), June 1994.

14
V. Christophides, S. Cluet, and G. Moerkotte.
Evaluating queries with generalized path expressions.
In Jagadish and Mumick [33], pages 413-422.
SIGMOD Record 25(2), June 1996.

15
S. Cluet.
Modeling and querying semi-structured data.
SCIE '97, 1997.

16
D. Colazzo and G. Ghelli.
Subtyping recursive types in kernel Fun, extended abstract.
In Proc. of the 14th Annual IEEE Symposium on Logic in Computer Science (LICS), Trento, Italy, June 1999.

17
R. Connor, G. Ghelli, and P. Manghi.
Modules and type abstraction in persistent systems.
In R. Connor and S. Nettles, editors, Proc. of the 7th International Workshop on Persistent Object Systems (POS), Cape May, New Jersey, pages 48-59, San Mateo, CA, May 1996. Morgan Kaufmann.

18
R. Connor, K. Sibson, and P. Manghi.
On the unification of persistent programming and the world wide web.
In Proc. of the Workshop on the Web and Data Bases (WebDB98), Valencia, Spain, 1998.

19
P. Curien and G. Ghelli.
Coherence of subsumption, minimum typing and type checking in F$_\leq$% WIDTH=16 HEIGHT=33 .
Mathematical Structures in Computer Science, 2(1):55-91, 1992.

20
M. Fernandez and D. Suciu.
Evaluating queries with generalized path expressions.
In Proc. International Conference on Data Engineering, pages 14-23, 1998.

21
M. F. Fernandez, D. Florescu, J. Kang, A. Y. Levy, and D. Suciu.
STRUDEL: A Web-site management system.
In J. Peckham, editor, Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pages 549-552, Tucson, Arizona, 13-15 June 1997.
SIGMOD Record 26(2), June 1997.

22
D. Florescu, A. Levy, and A. Mendelzon.
Database techniques for the World-Wide Web: A survey.
SIGMOD Record (ACM Special Interest Group on Management of Data), 27(3):59-74, 1998.

23
D. Florescu, A. Levy, and A. Mendelzon.
An overview of semistrucured data.
SIGACT News, 29(4):28-38, 1998.

24
G. Ghelli.
A class abstraction for a hierarchical type system.
In S. Abiteboul and P. C. Kanellakis, editors, Proc. of the Third Intl. Conf. on Database Theory (ICDT), Paris, France, number 470 in LNCS, pages 56-71, Berlin, 1990. Springer-Verlag.

25
G. Ghelli.
A static type system for message passing.
In A. Paepcke, editor, Proc. of the Sixth Intl. ACM Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA), Phoenix, Arizona, number 26 (11) in ACM SIGPLAN Notices, pages 129-145, Reading, MA, 1991. Addison-Wesley.

26
G. Ghelli.
Run-time support for hierarchic records in persistent languages.
In A. Albano and R. Morrison, editors, Proc. of the 5th International Workshop on Persistent Object Systems (POS), San Miniato, Pisa, Workshops in Computing, pages 107-123, Berlin, September 1992. Springer-Verlag.

27
G. Ghelli.
Recursive types are not conservative over F$_\leq$% WIDTH=16 HEIGHT=33 .
In M. Bezen and J. Groote, editors, Proc. of the International Conference on Typed Lambda Calculi and Applications (TLCA), Utrecht, The Netherlands, number 664 in LNCS, pages 146-162, Berlin, March 1993. Springer-Verlag.

28
G. Ghelli.
Divergence of F$_\leq$% WIDTH=16 HEIGHT=33 type checking.
Theoretical Computer Science, 139(1-2):131-162, 1995.

29
G. Ghelli.
Complexity of kernel Fun subtype checking.
In Proc. of ACM International Conference on Functional Programming (ICFP), Philadelphia, Pennsylvania, pages 134-145. ACM Press, May 1996.

30
G. Ghelli and D. Palmerini.
Foundations for extensible objects with roles, extended abstract.
In Proc. of the 6th Workshop on Foundations of Object-Oriented Languages (FOOL), San Antonio, Texas, January 1999.

31
G. Ghelli and B. Pierce.
Bounded existentials and minimal typing.
Theoretical Computer Science, 193(1-2):75-96, 1998.

32
R. Goldman and J. Widom.
Data guides: Enabling query formulation and optimization in semistructured databases.
In Proc. of the Intl. Conf. on Very Large Data Bases (VLDB), pages 436-445, San Mateo, CA, 1997.

33
H. V. Jagadish and I. S. Mumick, editors.
Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Quebec, Canada, 4-6 June 1996.
SIGMOD Record 25(2), June 1996.

34
D. Konopnicki and O. Shmueli.
W3QS: A query system for the world-wide web.
In U. Dayal, P. M. D. Gray, and S. Nishio, editors, VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, pages 54-65, Zurich, Switzerland, 11-15 Sept. 1995. Morgan Kaufmann.

35
A. O. Mendelzon, G. A. Mihaila, and T. Milo.
Querying the world wide web.
Journal on Digital Libraries, 1:54-67, 1997.


Index Page