Pisa - Dipartimento di Informatica - Research Evaluation Exercise 1999


ToMAKE:
Tools and Methodologies
for Data Analysis
and Knowledge Extraction



Proposer: Franco Turini (full prof.)

Participants: Antonio Brogi (associate prof.), Paolo Mancarella (associate prof.), Dino Pedreschi (associate prof.), Andrea Bracciali (PhD student), Francesco Bonchi (PhD student), Mirco Nanni (PhD student), Giuseppe Manco (PhD student), Alessandra Raffaet? (PhD student).

Collaborators: Patrizia Asirelli (IEI-CNR, Pisa), Simone Contiero (research assistant), Fosca Giannotti (CNUCE-CNR, Pisa), Gianni Mainetto (CNUCE-CNR, Pisa), Salvatore Ruggieri (Research Assistant), Krzysztof R. Apt (CWI, Amsterdam), V.S. Subrahmanian (Univ. Maryland, USA), Carlo Zaniolo (UCLA, USA).

Keywords: computational logic, logic-based database languages, knowledge discovery, data mining, spatio-temporal reasoning.

SUMMARY

We intend to address the problems rising in the construction of knowledge intensive applications for the analysis of large amounts of heterogeneous, multi-dimensional data.

To this purpose we aim at constructing and experimenting a software production environment and an associated methodology which:

· allows one to specify the various data dimensions (domain, temporal, spatial);

· allows one to specify the data analysis process;

· supports the separation of concerns between the conceptual design of knowledge intensive applications and their logical and physical design.

Building on previous and current activities of the group, we intend to construct around a computational logic kernel a programming environment with the capabilities of:

· representation of domain knowledge;

· spatio-temporal reasoning;

· semantic integration of heterogeneous data sources;

· interoperation with data mining, data analysis and information extraction packages;

· geographical information systems.

STATE OF THE ART AND TRENDS

A class of applications, based on intensive analysis of large amounts of data, is becoming of uttermost importance. The strong demand for intelligent data analysis comes from the increasing availability of massive information sources produced by organizations and stored in databases and/or made available through the Internet. The database technology, which is nowadays mature, easily and efficiently supports data warehousing and data aggregation/reporting, but hardly supports intelligent data analysis and knowledge extraction.

Examples of applications in the mentioned class are the following.

· Data Mining.

· Fraud detection - In this case extracting knowledge means constructing models of fraudulent behavior, e.g. in the fiscal domain, in stock market, in the insurance domain. In these cases, different data mining techniques are useful and need to be combined, and the analysis need to take into account other dimensions of data, beyond the standard, thematic dimension. For example, in order to construct models of insider trading, the stock market analysis must take into account the temporal dimension, which may not be explicitly recorded with data in the needed form.

· Market basket analysis - In this case extracting knowledge means constructing models of purchase behavior, e.g. from supermarket sales data. Again, any such model is naturally referenced both in the temporal and spatial (store location) dimensions.

· GIS applications - In domains such as management of cadastral data and environmental monitoring, complex data analysis is needed, capable of reasoning over diverse data dimensions in a uniform way: The standard thematic dimension as well as the temporal and the spatial dimensions. For example, cadastral particles as well as environmental samples are naturally characterized by all three data dimensions.

· Integration of structured, semi-structured, and unstructured data. Many application contexts need to support planning and decision via the analysis of data held in structured databases, data held in semi-structured format, like WEB sites, and unstructured data, like plain texts.

Why are these applications challenging?

First, they require reasoning on data referenced on multiple dimensions: thematic, temporal and spatial.

Second, the overall knowledge extraction process needs to be carefully designed and controlled, in that it is made of different steps which need to be specified, combined and iterated.

The most critical problems are

· Different data models exist for temporal and/or spatial data, which refer to different ontologies. The relevance of each such model is motivated by particular application needs.

· Data analysis tools (classification, clustering, association rules, ...), as well as OLAP and statistical tools, have reached a state of relative maturity, but the overall knowledge discovery (KDD) process is poorly supported by the available environments and systems.

· Knowledge-intensive applications require a very flexible software architecture in which components with different aims and structure can be combined.

RESEARCH AIMS AND APPROACH

Our general aim is to develop a software production environment and an associated methodology for the mentioned class of data analysis applications which:

1. allows one to specify the various data dimensions;

2. allows one to specify the data analysis process in all its steps;

3. supports clear separation of concerns among:

· the conceptual design of knowledge-intensive applications: data analysis requirements, integration of domain (background) knowledge, expert (business) rules and extracted (mined) knowledge, validation and evaluation of extracted knowledge, mediation and semantic integration between different data models and analysis paradigms;

· and the logical and physical design of knowledge-intensive applications, concerning the specification of logical and physical aspects of coupling and interoperability with existing external components (databases, data models, data mining tools, visualization tools, GIS's, desktop tools, etc.), and among the internal components.

Our main choice is to base our methodology for the data analysis process and our associated programming environment on computational logic. We have developed in the past, we are developing now, and we plan to develop in the future uses and extensions of computational logic to support:

1. A rich database query language. This is nowadays a classic use of logic-based programming languages. Logic-based databases, like LDL++, allows for queries that cannot be handled by standard relational databases systems and, at the same time, allows to integrate commercial database systems for massive data handling.

2. Knowledge representation. A logic-based programming language is well suited for representing knowledge in the form of deductive rules. A typical use in our context is the representation of business rules to be applied to results of mining - e.g. exploiting association rules extracted from a basket analysis application in order to plan the articles to put on sale. As another example, consider an application to city planning where geographical data have to be handled together with the rules establishing the construction constraints. Furthermore our own work aims at showing how temporal and spatial knowledge can be represented and manipulated in this context.

3. Component integration. Our own work has been directed to propose extensions of logic languages towards mediator languages , i.e. languages able to allows programming the glue necessary to put together different components, let them be deductive components written in the logic language, data mining components, geographical information systems, information extraction agents, or WEB search engines. We expect to build applications around a component oriented or, to use an on-fashion terminology, an agent oriented architecture. Part of our research work is aimed at extending computational logic in order to support agent oriented features like interaction and communication.

4. Extensibility. One of the key characteristics of logic based languages is their ability to support meta-programming, that is, in this context, meta-logic. Meta-logic provides for the possibility of extending the language from within the language itself. We have already used this basic feature, and we plan to use it in the future for several purposes:

· representation of and reasoning about temporal knowledge

· representation of and reasoning about spatial knowledge

· representation of the results of applying data mining techniques and reasoning over them. This has already shown its usefulness in that it has allowed to compare the results of different data mining techniques applied to the same data.

CURRENT STATE OF ACTIVITY

In this section we report on recent research activities in three areas related to the general aim of this project.

KDD Laboratory. The issues that have been investigated may be summarized as follows:

· Knowledge Discovery Support Environments, namely support for data mining in database engines, logic-based languages and interfaces for data mining;

· Experience in building and deploying data mining applications;

· Data mining system architectures, namely efficiency and scalability in data mining.

The objective of this research activity is to demonstrate how a suitable integration of deductive reasoning, such as that supported by logic database languages, and inductive reasoning, such as that supported by data mining tools, provides a viable solution to many high-level problems in knowledge intensive applications. The key to succeed in constructing effective data analysis and decision support systems is the ability of integrating the results of knowledge extraction with expert rules. A logic database framework is taken as the basis for constructing such integration. The adopted research methodology has started from case studies in market basket analysis and fraud detection, and it has distilled from the applications the design and architecture of a prototype inductive-deductive system for data analysis. Two relevant projects have been carried out in the last two years:

a) Datasift: in collaboration with CNUCE-CNR, Pisa, and COOP Toscana Lazio, a division of one of the major Italian supermarket companies. The project has delivered a prototype of an intelligent system for market basket analysis, which integrates the deductive capabilities of a logic-based database language, LDL++, developed at UCLA, with the inductive capabilities of diverse data mining algorithms and tools, notably association rules. An application to supermarket sales data has been developed, which supports business rules for the market analysts, such as measures of the effectiveness of a promotional campaign on purchasing behavior.

b) Fiscal Fraud Detection: in collaboration with CNUCE-CNR, two Italian Universities and the Ministry of Finance. The project has delivered a case study in building and deploying a data mining application in the field of fraud detection. In particular, a methodology for using classification and learning techniques has been developed, which supports the definition of auditing strategies for the detection of tax evasion. The audit planning strategy is aimed at reaching an optimal compromise between number (and cost) of audits and recovery of tax evasion from fraudulent taxpayers. As in the Datasift project, an extension of LDL++ with external classification tools has been used to formalize the knowledge discovery process.

Program composition and interaction. A large part of the research activities of the group has been devoted to the definition of well-founded methodologies for the disciplined development of software via a program composition approach. Indeed most modern computer systems consist of large numbers of software components that interact one another, and increasing attention is being oriented towards methodologies for combining and integrating existing components to develop large-scale systems.

The approach developed focusses on the definition of meta-languages that support the integration of separate software components, and employs the logic-based programming paradigm both for the component specification and for the meta-language definition. The approach is centered on the use of well-defined, meta-level operations for composing object logic programs that form the kernel of the proposed meta-languages. The development of the program composition approach has involved different aspects ranging from the formal definition of a basic set of composition operations (semantics), to the implementation of the meta-languages (both compilation- and interpretation-oriented), to the application of the proposed meta-languages to different problems (from programming-in-the-large to artificial intelligence applications).

The paradigm shift from stand-alone, isolated systems to interacting, possibly distributed systems has made the issue of interaction one of the central issues both in the theory and in the practice of computer science. Our more recent activity has been devoted to develop a logic-based interpretation of the behavior of programs that may interact dynamically with a (partially known) external environment during their computation. The interactive behavior of logic programs has been experimented by developing examples of WEB-based knowledge extractors.

Spatial and temporal reasoning. Temporal reasoning is at the heart of human activity and not surprisingly it has raised a lot of interest in computer science, be it in the form of temporal logics, temporal programming or temporal databases. Our recent work aimed at building a framework where temporal information can be naturally represented and handled, and, at the same time, knowledge can be separated and combined by means of meta-level composition operators.

We have defined a language, called MuTACLP, that extends the idea of program composition discussed above by allowing the temporal annotations of atomic formulae. The annotations make time explicit, but avoid the proliferation of temporal variables and quantifiers of the first-order approach. In this way, the proposed language supports qualitative and quantitative temporal reasoning and allows one to represent definite, indefinite and periodic temporal information and to work both with time points and time periods (time intervals).

As far as spatial data are concerned, we model them by using constraints which are a programming facility provided by MuTACLP. The idea is to triangularize a spatial object and to represent each triangle as a constraint. By composing such constraints we obtain the constraint associated to the spatial object. Then we support the ordinary set theoretic operators (union, intersection, difference, complement) and the topological relations (disjoint, meet, equal, overlap, inside, convers) among spatial objects.

SHORT TERM DEVELOPMENTS

In this section we describe the activities of the group in two ongoing projects:

1. Intelligent Agents: Interaction and Knowledge Acquisition, a project funded by the Italian University and Research Ministry, involving eight Universities and led by Franco Turini.

2. DeduGIS: Deductive constraint databases for intelligent Geographic Information Systems, an ESPRIT Working Group, a Working Group funded by European Union, involving six research groups and a board of industrial partners from Italy, Germany and The Netherlands, and led by Fosca Giannotti, CNUCE, Pisa.

1. Intelligent Agents: Interaction and Knowledge Acquisition

What follows is a short description of the general aims of the project followed by a description of the tasks performed by this group in Pisa.

General aims of the project.

Given the extraordinary growth of the information sources that are available via computer networks, the problem of combining distributed and heterogeneous information sources is becoming critical. The information sources may include traditional databases, files, archives of texts, images and sounds, knowledge bases, programs, web sites and so on.

A promising approach to this problem is the realization of systems of agents for extracting knowledge and reasoning about it. In such systems, each agent is specialized for interacting with a specific kind of information source, and it is capable of interacting with other agents of the system in order to synthesize the answer to user's queries. Furthermore, in general, the collection of information occurs in the framework of cooperation processes that involve different actors located in different sites. Agent systems are also a promising technology for supporting cooperation, especially because of their awareness of the context in which they are operating.

The goal of this project is the study of the theoretical foundations, the realization of the implementation tools, and the experimentation on specific applications of agent systems for collecting and synthesizing information, and for supporting cooperation.

The role of our group

The contribution of the working group, in the field of agent architectures, is related to the extension of the language MedLan with the ability of coordinating agents. Currently MedLan is an extension of logic programming that provides support to mediation. MedLan agents are either logic programs that interface with external information sources (e.g. relational databases or geographic information systems) or mediators , that is agents which merge the results of the agents for information extraction in order to satisfy the queries of the user. We will design an extension to the computation model of MedLan to allow an agent to react to the external environment, modeled as a set of agents, the nature of which is not necessarily known when defining the single agent. Such an extension of the computation model implies a extension to the language, that offers explicit mechanisms for communicating and synchronizing with other agents, in the spirit of the model of generative communication originally proposed in the language Linda.

In the subfield of data mining the group has the goal of constructing an agent capable of integrating in the basic logic language the interaction with data mining algorithms at two technological levels. At the higher level, data mining techniques provide tools and methods for the analysis of large amounts of data on the basis of the extraction of implicit information in the form of patterns or interesting rules, potentially useful for the user, although not directly available.

At a lower level, logic languages for databases provide an adequate linguistic support for the use of data mining techniques and their integration with relational databases, since they can directly exploit an explicit representation of the knowledge (rules) extracted via inductive techniques. The declarative nature of the technology, that allows one to express what is needed, rather than how is got, by leaving the control decisions to the system, is such that the combined use of inductive technology and deductive technology allows one to model complex analyses, necessary to extract and refine knowledge that is not directly available via basic data mining tools. Very often, for example, data can be characterized via multiple abstraction levels.

Launching a data mining process over a single abstraction level can be of little meaning. On the contrary you can get better results by a method that progressively focuses on interesting features at different levels of abstractions.

2. Dedugis

What follows is a short description of the general aims of the project followed by a description of the tasks performed by this group in Pisa.

General aims of the project.

The research community agrees that the current generation of GIS poorly matches important demands coming from the applications, such as:

· the ability to express spatio-temporal correlations by means of high-level mechanisms;

· the ability to integrate and transparently interact with heterogeneous information sources and tools for processing information, such as diverse problem solvers and reasoning methods.

DeduGIS intends to study the methods and tools that make logic-programming-based software the right glue technology for constructing knowledge-intensive GIS applications. The ultimate goal of the working group is to envisage a new generation of GIS's, characterized by enhanced capabilities to support:

· spatio-temporal reasoning,

· semantic integration of diverse data models.

In particular the working group will address and study:

· a methodology to conservatively extend current GIS technology;

· the design of a software component to be used in the development environment of GIS applications, which supports the above enhanced capabilities;

· application demonstrators developed using such new capabilities, in the areas of cadastral (real estate) data management and environmental monitoring.

·

The design of DeduGIS will be based upon the identification and the amalgamation of a repertoire of programming facilities, distilled from:

· deductive constraint databases, which merge deductive databases and constraint logic programming;

· logic-based mediator languages, which declaratively support the semantic integration of heterogeneous data.

The role of our group

The contribution of our group in DeduGIS is in the definition of the spatio-temporal logic-based formalism, and the design of the constraint mediator language.

LONG TERM DEVELOPMENTS

The long term goal of the group is the construction of a programming system, along with a number of its applications, which validates our current idea that a computational logic kernel can support a programming environment with the capabilities of:

· representation of domain knowledge

· spatio-temporal reasoning

· semantic integration of heterogeneous data sources

· interoperation with:

· data mining packages

· information extraction packages

· geographical information systems

· data analysis packages

To this goal we shall pursue a mixture of application driven and technology driven approach. The applications will provide the needed requirements for the programming environment. As far as technology is concerned we shall push computational logic to emcompass all the features for supporting forms of reasoning (spatio-temporal reasoning, meta-reasoning over results of data-mining, representation of domain knowledge, programming of mediators), while exploiting consolidated technology by making the kernel interoperable with tools for data management (databases, GIS's, data mining packages, information extraction packages, data analysis tools).

CURRICULA VITAE of Participants

Franco Turini

Franco Turini was born in 1949 in Italy. He graduated in Computer Science (Laurea in Scienze dell' Informazione) summa cum laude in 1973, from the University of Pisa. He is currently a full professor in the Department of Computer Science of the University of Pisa. In 78/80 he has been a visiting scientist of the Carnegie-Mellon University (Pittsburgh) and of the IBM Research Center S.Jose, afterwards. In 92/93 he has been visiting professor at the University of Utah. His research interests include programming languages design, implementation, and formal semantics especially in the field of functional and logic programming. In 1990-1993 he has been responsible for an Italian national programme sponsored by the National Research Council for "New generation Languages". He has been coordinator of the "Meta- and non-monotonic reasoning area" of the BRA Esprit Action No 6810 "Compulog II". Recently, he co-edited with Krzysztof Apt a book entitled Meta-logics and Logic Programming, published by The MIT Press. He has been in many program committees, the program chairman of the workshop "Meta-programming in Logic 1994", and the editor of the proceedings issued by Springer-Verlag as LNCS 883 (1994). He is currently serving as Department Chairman.

Antonio Brogi

Antonio Brogi was born in 1964 in Italy. He holds a Ph.D. in Computer Science (University of Pisa, 1993) and he is currently associate professor at the Department of Computer Science at the University of Pisa. He has been visiting professor at U.C.L.A. (1994/95) and visiting scientist at the Imperial College (1992). His research interests include programming language design, logic-based programming, and program interaction and coordination. He is participating in several international research and cooperation projects. He has served in several conference program committees and he is currently member of the editorial board of the journals Computer Languages and Information.

Paolo Mancarella

Paolo Mancarella was born in 1959 in Italy. He received his Ph.D. in Computer Science in 1988 at the University of Pisa, Italy. From 1988 he worked at the Department of Computer Science in Pisa as assistant professor. In 1988-89 he was on sabbatical leave at Imperial College, London. From 1992 he is associate professor at the University of Pisa, Italy. He has been the coordinator of the project KIT011-LPKRR, Keep In Touch Cooperative Activity on Logic Programming in Knowledge Representation and Reasoning in the period 1993-1997.

Dino Pedreschi

Dino Pedreschi was born in 1958 in Italy, and holds a Ph.D. in Computer Science from the University of Pisa, obtained in 1987. He is currently an associate professor at the Dipartimento di Informatica of the University of Pisa, serving as the coordinator of the undergraduate studies in CS (Diploma Universitario in Informatica). He has been a visiting scientist and professor at the University of Texas at Austin (1989/90), at CWI Amsterdam (1993) and at UCLA (1995). He has a long-standing collaboration with K. R. Apt (CWI) on verification methods for logic programming, and with C. Zaniolo (UCLA) and V. S. Subrahmanian (Univ. of Maryland) on various topics of logic in databases. He has been the coordinator of the working group Non-determinism in deductive databases, jointly sponsored by the European Union and the US National Science Foundation, involving four European and two US universities and research centers. He has been the co-editor of two books on the subject of the working group. His current research interests are in logic in databases, and particularly in data analysis, in deductive databases, in the integration of data mining and database querying, in spatio-temporal reasoning, and in formal methods for deductive computing.

Selected group publications

D. Aquilino, P. Asirelli, A. Formuso, C. Renso, F. Turini. Using MedLan to integrate geographical data. The Journal of Logic Programming , 1999.

A. Brogi, S. Contiero, and F. Turini. Programming by composing general logic programs. Journal of Logic and Computation, 9(1):7-24, 1999.

F. Giannotti, D. Pedreschi and C. Zaniolo. Semantics and expressive power of non deterministic constructs for deductive databases. Journal of Computer and Systems Sciences. To appear, 1999.

F. Giannotti, M. Nanni, G. Manco, D. Pedreschi and F. Turini. Integration of Deduction and Induction for Mining Supermarket Sales Data. In Proc. PADD'99, Practical Application of Data Discovery, Int. Workshop, London, 1999.

F. Giannotti, G. Manco, D. Pedreschi and F. Turini. Experiences with a logic-based knowledge discovery support environment. To be presented at 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (SIGMOD'99 DMKD).

P. Mancarella, A. Raffaet?, F. Turini. Knowledge representation with multiple logical theories and time. Journal of Expt. Theor. Artif. Intell. 11:47-76, 1999.

A. Brogi and S. Contiero. A Program Specialiser for Meta-level Compositions of Logic Programs. New Generation Computing , 16(2):123-162, 1998.

A. Brogi and J.M. Jacquet. On the expressiveness of Linda-like concurrent languages. Electronic Notes in Theoretical Computer Science , 16(2):61-82, 1998.

F. Giannotti and D. Pedreschi. Datalog with non deterministic choice computes NDB-PTIME. Journal of Logic Programming, 35(1): 79-101, 1998.

D.Aquilino, P. Asirelli, C. Renso, F. Turini. Applying Restrictions Constraints to Deductive Databases. Annals of Mathematics and Artificial Intelligence, 19(1-2), 1997.

A. Brogi, V.S. Subrahmanian, and C. Zaniolo. The Logic of Totally and Partially Ordered Plans: A Deductive Database Approach. Annals of Mathematics and Artificial Intelligence, 19(1,2):27-58, 1997.

A. Brogi, E. Lamma, P. Mello, and P. Mancarella. A Unifying View for Logic Programming with Non-Monotonic Reasoning. Theoretical Computer Science, 184(1-2): 1-59, 1997.

P. Mancarella, A. Raffaet?, F. Turini. Time in a Multi-Theory Framework. In Proc. TIME97, Daytona Beach, USA, May 1997.

D. Pedreschi and V. S. Subrahmanian (editors). Non-determinism in Databases. Special issue of Annals of Mathematics and Artificial Intelligence , 19(1-2), 1997.

D. Aquilino, C. Renso, F. Turini. Towards Declarative GIS analysis. In Proc. Fourth ACM GIS Workshop, Rockville, Maryland, USA, November 1996.

P. Asirelli, C. Renso, F. Turini. Language Extensions for Semantic Integration of Deductive Databases. In Proc. LID96, Int. Workshop on Logic in Databases, LNCS vol 1154. S. Miniato, Pisa 1996.

D. Pedreschi and C. Zaniolo (editors). Logic in Databases . Springer-Verlag, Lecture Notes in Computer Science, N. 1154, 1996.

A. Brogi, F. Turini. Fully Abstract Compositional Semantics for an Algebra of Logic Programs, Theoretical Computer Science, 149(2):201-229, 1995.

A. Brogi, P. Mancarella, D. Pedreschi, F. Turini. Modular Logic Programming, ACM Toplas, 16(4):1361-1398, 1994.

K. R. Apt and D. Pedreschi. Reasoning about termination of Prolog programs. Information and Computation 106(1): 109-157, 1993.

R. Barbuti, P. Mancarella, D. Pedreschi e F. Turini. A Transformational Approach to Negation in Logic Programming. Journal of Logic Programming 8(1): 201-228, 1990.



Index Page