Pisa - Dipartimento di Informatica - Research Evaluation Exercise 1999


Architectures and Programming Tools
for
High Performance Computing



Proposer

Prof. Marco Vanneschi

Participants

University of Pisa, Department of Computer Science staff :
Silvano Antonelli (90%), Fabrizio Baiardi (90%), Marco Danelutto (90%), Laura Ricci (90%), Marco Vanneschi (90%).

University of Pisa, Department of Computer Science, PhD students :
Marco Aldinucci (100%), Massimo Coppola (100%), Paolo Mori (100%), Andrea Zavanella (100%).

Collaborators

· ENS, Paris, France: R. Di Cosmo

· University of Passau, Germany: S. Gorlatch, C. Lengauer

· INRIA, Rocquencourt, France: X. Leroy

· Queens University, Canada: D. Skillicorn

· University of Valencia, Spain: J. Duato

· University of Edinburgh, UK: M. Cole

· University of Oxford, UK, and Los Alamos Labs, USA: F. Petrini

· HP Labs: M. Jazayeri

· Several CNR institutes in the PQE2000 project

Keywords

high performance computing, parallel architectures, parallel computing models, skeletons, distributed shared memory, multithreading, Processors-In-Memory.

Title

Architectures and Programming Tools for High Performance Computing

Summary

Our research activity mainly aims at studying the design of the programming models, compiling tools, architectures and basic mechanisms (both hardware and software) that can be used to improve High Performance Computing (HPC) systems and applications.

Concerning the programming models and the compiling tools, we investigate (the possibility and) the feasibility of designing expressive and efficient structured parallel programming environments. This activity led to the definition of a structured, skeleton based, explicitly parallel programming language, and of a suitable framework of compiling tools for the language, delivering good performance measures on a variety of massively parallel machines. Current research activity on this track is focused on different topics, including: template based compiling techniques, dynamic implementation techniques (data-flow based), optimization strategies based on program rewriting and transformation using suitable cost calculus models , aspects related to the simultaneous exploitation of both control and data parallelism, support for structured access to shared objects , evaluation of the architectural impact of the different kind of parallelism exploitation patterns considered.

We also investigate the feasibility and the design of new optimization techniques related to more traditional parallel programming models, such as those in HPF (High Performance Fortran) track, based on the exploitation of information derived by static analysis tools such as abstract interpretation.

Concerning the architecture and the basic hardware-software mechanisms for HPC, our research activity is mainly focused on a couple of distinct, but related, topics: on the one hand, we investigate the design and feasibility of implementing shared memory support on massively parallel architectures based on Distributed Memory MIMD model (Distributed Virtual Shared Memory). On the other hand, we are currently investigating the impact of the adoption of different architectural models, based on non-conventional components, such as intelligent memory modules (Processors In Memory, or PIM) and/or multithreaded processing elements, in composite architectures including a general purpose, message passing, NUMA section.

The work on parallel architectures and optmization techniques exploits also the results of a research activity in wormhole interconnection networks and their performance models for abstract and concrete parallel machines.

We are also concerned in the study of the aspects related to the usage of network of commodity computers as cost effective parallel machines.

The main goal of the research on the above mentioned topics is mainly to derive methodologies and sound design techniques for parallel systems and applications. However, a significant part of our research is experimental: to give relevant results, most research issues need intensive experiments and prototyping activities, both at the system and at the programming tools level, along with simulation activities.

The participation, with a leadership role, to past and current national projects allowed us to verify the research results on real problems and to favor and to coordinate the transfer of results onto advanced products for several kinds of architectures and platforms (Transputer-based till the beginning of the 90's, then commodity processor-based MPP machines and workstation clusters). Transfer of results into industrial innovative products is a current task of this group.

State of the Art and Trends

Currently, different computer architectures as well as programming models and support tools are considered in the field of HPC.

Concerning computer architectures, ?classical? Massively Parallel Processor (MPP) architectures, Symmetric multiprocessors, and network of workstations are taken into account for general-purpose systems. Classical MPP architectures are mostly based on the interconnection of standard (commodity) processing elements via different kinds of interconnection networks (k-ary n-cubes and fat trees, mainly) and on the usage of specialized hardware for fast processor-network interfaces (communication coprocessors, routing coprocessors). Symmetric multiprocessors , although not explicitly in the area of ?massively? parallel processing, deliver high performance exploiting superpipeline/superscalar processing elements sharing a global memory space, whose access is optimized exploiting different cache-memory hierarchy optimization techniques. Finally, networks of PCs or workstations, interconnected by using fast, standard network (Fast Ethernet, Gbyte Ethernet, ATM networks) are often considered as cost/effective machines able to deliver respectable performances at a smaller cost with respect to classical MPP machines.

Nowadays MPP machines deliver Mega to GigaFlops in configurations including hundred to thousands of processing elements. Some of the machines demonstrated peak TeraFlops performance. The current trends in MPP development have the PetaFlops peak performance as the ultimate target, and are focused on the study of new, faster and higher bandwidth interconnection networks (and related hardware support), on the usage of faster and faster processors, possibly implemented by exploiting super-conductivity technology, and on the development of new processing elements, with hardware multithreading support and better memory hierarchies based on the adoption of computing resources at the memory side (Processors In Memory, or PIM). Novel techniques for the efficient management of memory hierarchies, shared objects and I/O are the critical issues for the research in the area of next generation, general-purpose, MPP machines.

Most of current HPC/MPP machines are mainly programmed using either imperative programming languages in conjunction to message passing libraries (C or Fortran with calls to MPI or PVM), or languages derived from classical imperative programming languages, such as Fortran, extended with some annotations to express data parallel computations (HPF). Although this is a sort of de-facto standard concerning the programming environments for high performance computing machines, recently skeleton based, structured parallel programming environments have been studied as a valid, viable alternative to HPF and sequential languages plus communication libraries. Such programming languages/environments (Hence, P3L/SkIE, FortranS) are able to achieve performances comparable to these achieved by using sequential languages with communication library calls, but provide definitely higher expressive power and rapid prototyping features, leading to better software productivity and application portability.

The research efforts in the field of parallel programming models and tools are focused on the integration of different parallelism exploitation patterns, e.g. data and control parallelism within the same framework (MPI/2, HPF/2, OpenMP), and on the study of new programming models based on languages with a better parallel/concurrent activities support (High Performance Java) with respect to classical sequential programming languages.

Furthermore, new programming models and/or ways of expressing parallel computations are studied, both in the HPF track and in the structured/skeleton parallel programming framework, according to an approach that allows more choices and optimizations to be moved from programmer to compiling tools responsibility. Such kind of optimizations can be taken by compiling tools exploiting a better knowledge of the parallel computations due to the usage of better abstract computing models recently developed, in general associated to a formal cost model (e.g. BSP). Overall, the research activity is moving towards the development of programming tools delivering more and more performance on a wide range of parallel architectures, possibly requiring limited effort to the programmers when programs are moved across different platforms, but still characterized by intensive performance-directed optimizations either at compile time or at run time (performance portability issue).

A key issue in trends for HPC systems is matching architecture and software technology. The approach in which the programming model is rigidly derived from the architectural model (or even from a specific architecture) has been, unfortunately, adopted for many years, leading to non-portable, non-modular, very expensive and error-prone parallel programs for SIMD and MIMD specific machines. Recently, it has been widely recognized the fundamental value of general-purpose HPC machine, for which the correct approach consists in the following points:

a) the programming model must be able to efficiently exploit the new directions in computer architecture. For example, programming models that address the widening gap in performance between CPU and memory, or models able to support dynamic scheduling of parallel jobs in a multiuser environment;

b) at the same time, the new directions in computer architecture should be influenced by the general characteristics of the programming models. For example, provided that massively parallel programs can be expressed easily by the programming model, it is fundamental to realize new MPP general purpose architectures which are effectively able to exploit large amounts of fine grain parallelism;

c) a specific issue is the ability of the programming model and tools to support the heterogeneity of the underlying physical machine: future generation MPP computers will exploit more than one technology, typically a mix of commodity-based components and proprietary components, as well several architectural sytles in combination.

An issue of growing importance in matching parallel architectures and software technology is the set of problems caused by the increasing gap between processors and main memory. A general trend in so-called memory-oriented models for HPC systems seems to be common to the most interesting architectural models for the next generation: the exploitation of larger amounts of fine grain parallelism both for calculations and for system functionalities support. Two main solutions, not mutually exclusive, can be individuated:

· multithreading

· Processors-In-Memory (PIM)

These issues are studied and experimented in important projects, namely the HTMT project in US, and, in Europe, in our PQE2000 project.

In architectures based on multithreading, fine grain parallelism is a way to fully exploit the computational resources (functional units) of each processor, and at the same time to hide the latency for remote memory accesses or communications. It is very important to evaluate the multithreading power in the implementation of different forms of parallelism and their composition, in particular for massively parallel irregular/dynamic applications.

The PIM model consists in placing multiple processors and interconnect logic on a high density memory chip. The main purpose is ?to execute in memory? the fine grain, highly parallel computations without incurring in (or better, minimizing) the performance penalties introduced by the processor-memory interface of the conventional computer structures. A single-chip PIM (containing from 101 to 103 Risc-like simple processors) is then a building block that can be used to form very compact MPP structures for PetaFlops computing and for novel solutions to greatly increase the efficiency of system functionalities which, in the next generation, could represent unacceptable bottlenecks (e.g. memory hierarchy and shared objects management).

Relevant research activities at the Department

Prof. Vanneschi's group at the Department has been involved in different projects since the 80', concerning both the design of parallel and scalable machines and the design of parallel programming environments, tools and programming methodologies, and it is currently involved in different projects and activities related to HPC:

a) the PQE2000 project, a joint initiative of the main Italian research institutions - CNR (National Research Council), ENEA (National Agency for Energy, Environment and New Technologies), INFN (National Institute for Nuclear Physics) - and of Finmeccanica's QSW (Quadrics Supercomputers World Ltd) for the realization of innovative HPC general-purpose systems and their applications in industry, commerce and public services. Marco Vanneschi is the Scientific Director of the PQE2000 project. Over 15 R&D groups from the research institutions, universities and industry collaborate to the project. As a consequence of PQE2000 emerging technology, a Research Programme on HPC has been definined by MURST (Ministry of University and of Scientific and Technological Research) aiming at the definition of a large European initiative in HPC technology and applications. PQE2000 has established important collaborations, in particular with the Ministry of Finance fir the development of HPC tools for tax evasion fight. The research activities of PQE2000 include MPP architecture, software tools and environments, and applications in technical, transactional and new media areas. A research road map towards PetaFlops MPP architectures and programming environments has been established for the next four years and, at the same time, a first, innovative version of PQE2000 hardware and software technology has been realized by integrating stable basic products;

b) the MURST project Mosaico (Methodologies and Tools of High Performance Systems for Multimedia Applications), investigating the problems related to the implementation of high performance systems (distributed and/or parallel) able to efficiently support multimedia applications. Within this project, our group is mainly involved in the design of software tools and programming environments supporting high performance computing on workstation clusters;

c) the Galileo exchange project with the French partners E.N.S. (Paris) and INRIA (Rocquencourt), investigating the possibility to embed skeleton programming models within an ML-like functional programming language (Ocaml by INRIA). Within this project our group participates in the development of a prototype Ocaml programming environment allowing the user to program parallel applications running on top of a Unix workstation network using a small set of predefined parallel skeletons;

d) the Vigoni exchange project with University of Passau, in which the skeleton based transformation rules developed in Passau and the skeleton language P3L developed in Pisa are joined in an interactive transformation system for skeleton programs.

The group is also involved in the activities of the local PhD school. The group coordinates activities of PhD students interested in the development of complex, symbolic applications using parallel programming models (structured parallelism and data mining), in the design of new architectural models exploiting intelligent memory hierarchies (PIM and multithreading), in the exploitation of abstract machines and cost models in the optimization of parallel programs (BSP + skeletons ), in design methodologies and tools for shared memory (Distributed Virtual Shared Memory and irregular problems).

Overall, the main results recently achieved by the group members within this research framework concern:

1. the design of structured parallel programming languages based on skeletons (P3L, SkIE) and the implementation of suitable compiling tools and run time supports (P3L compiler, SkIE-CL programming environment) [DDO+92, BDP+95, BCD+97, BDP99, Van98b, Van98c]. Within this framework original implementation techniques and program transformations have been developed (template based compiling tools, cost calculus driven skeleton rewriting and transformation techniques, structured parallel application development methodology) [ACD98, AD99, CDF97, DDL98, OV95, ZP99].

SkIE (Skeleton-based Integrated Environment) is the programming environment of PQE200. Through the coordination language SkIE-CL it is possible to design complex applications, whose global structure is expressed by the combination of stream parallel and data parallel skeletons, and whose component modules can be written (or are existing modules that have been written) in a variety of standard ?host? languages and tools (C, C++, Java, F77, F90, HPF) or by standard languages enriched with message-passing libraries. Integration of standard tools into a same parallel environment is fundamental to design complex, multidisciplinary applications. The SkIE-CL compiler performs ?global? optimizations according to the cost model of the underlying machine: this renders it possible to achieve software reuse and performance portability for a variety of machine platforms. The ideas and the scientific coordination of SkIE are part of the work of this research group, while the implementation and engineering have been done by the industrial partner of PQE2000;

2. the architectural model of PQE2000, which is a hierachically composite MPP architecture, where the outermost level has a NUMA (Non Uniform Memory Access) architecture. The NUMA characterization has been obtained by a underlying distributed memory architecture, realized by the industrial partner of PQE2000, with the addition of a very efficient Distributed Virtual Shared Memory support realized by this research group. In turn, each node NUMA node contains a commodity SMP (Symmetric MultiProcessor) part, and a proprietary PIM (Processor-In-Memory) part which supports the implementation of instances of highly parallel data types. In the current implementation, for experimentation purposes, PIM is emulated by a SIMD machine plus proper software interfaces on the NUMA nodes [Van98a,Van98c,Van99]. All the coordination of the research transfer has been done by this group. Large configuration of PQE2000 machines, with parallel OS and SkIE programming environment, have been installed at the Research Labs of CNR in Naples and at the Enea Labs in Rome. These HPC systems are intensively used by the PQE2000 groups to develop applications in the various areas of interest;

3. models to evaluate the communication performance of wormhole interconnection networks (k-ary n-cubes and fat trees), new routing and flow control strategies and their relationships with parallel programming models (BSP, skeletons) [PV96a, PV96b, PV96c, PV97a, PV97b, PV97c, PV97d, PV97e, PV97f, PD97,PV98a, PV98b, PV99];

4. the definition of static analysis tools that could be suitably used in a range of different cases related to the optimization of parallel programs. Within this track, we developed a systematic approach to exploit abstract interpretations, that can be used to derive different techniques optimizing parallel loop execution [BCR96, BR95, LL99];

5. the utilization of abstract machine models and of the related cost calculus to devise proper optimization strategies for structured parallel applications. The BSP model has been utilized to design a machine-independent technique to choose the ?best? composition for data parallel modules. Moreover a BSP optimal ?multibroadcast? strategy has been proposed to replace standard ?multicast? within data distribution phases. These results are to be considered a starting point in developing a fully portable implementation for a skeleton environment [ZP99, ZAV99, ADC98, AD99]

6. the efficient parallelisation of complex applications (e.g data mining applications in the PQE2000 - Ministry of Finance collaboration) exploiting structured parallel programming methodologies [BCV99. This has been one of the first cases in which the SkIE-PQE2000 enviroment has demonstated its power in terms of programmability, software productivity, sofware reuse and performance portability: an existing code for association rules has been parallelized and fully tested and evaluated in a very short time (one mounth), with a reuse degree of over 95%, and optimized automatically both for a MPP machine (PQE2000) and a PC cluster (Mosaic).

Short term plans and expected results

Structured parallel programming models and related tools

We expect results on the following topics, in the next two years:

a) development of novel compiling algorithms/strategies for structured parallel programming models, exploiting much more dynamic behaviour than the current ones;

b) full integration of algorithms providing data parallel and control parallel optimization within the compiling tools of structured parallel programming languages;

c) integration and unification of new skeletons (derived from the large application testbed performed in the context of the PQE2000 project) and of shared objects features into the same structured parallel programming models, along with some preliminary optimization algorithms and strategies;

d) development of implementation and optimization strategies especially targeted at workstation cluster machines;

e) full integration of the skeleton parallel programming model in the framework of a common, sequential, functional programming language (ML, Ocaml).

New architectural models

Within the framework of the PQE2000 we will initially evaluate the performance of a SIMD solution to PIM, using the SIMD emulation of a PIM component, in order to compare it with fine grain MIMD solutions to PIM for the next versions of the machine. Short term plan in composite architectures for PetaFlops computing includes the following issues:

a) interconnection structures able to efficiently support the cooperation between the NUMA section, possibly based on multithreaded processors, and the PIM section;

b) PIM architecture in terms of granularity of ?active? locations, scheduling control strategies, multithreaded organization of PIM processing elements, and internal interconnection structure;

c) static and dynamic task distribution strategies between NUMA and PIM sections for massively parallel programs;

d) dynamic scheduling of parallel jobs in multiuser systems.

Shared memory on distributed memory machines

In the short term, we expect results on the shared memory implementation for distributed memory machines to be achieved through the evaluation and the performance tuning of the packages developed by our group, and by comparing them against other currently available packages. The main results expected from these evaluations are:

a) the performances that can be achieved on architectures ranging from workstation clusters to MPP architectures;

b) the evaluation of alternative consistency models for the shared data, from one fully equivalent to the sequential one to weakest models, according to the algorithm characteristics;

c) the evaluation of alternative compilation strategies for high level shared objects;

d) the definition of proper hardware/firmware supports for the package.

These issues, in all the three tracks (programming tools, architectural models, shared memory), will be investigated according to a uniform hardware/software integrated approach. In particular, we will study the relationships between multithreaded-PIM based architectures, structured programming models and optimisations tools, and virtual distributed shared memory. The experiments will be performed on network computers (Mosaico) and on PQE2000 composite machines, as weel as on fast prototypes derived from PQE2000 plus proper off-the-shelf components and interconnection structures.

Long term scenarios

Structured parallel programming models and related tools

We plan to exploit the experience gained in these years to try to actively participate in the definition of the new programming models required for the PetaFlops generation machines. In particular, we plan to address the following topics:

1) definition of a new parallel programming language providing the assessed features delivering high performance in the framework of both skeleton and HPF-like parallel programming languages;

2) definition of new implementation techniques for structured parallel programming languages, releasing the static constrains related to the template implementation technique currently adopted and exploiting more dynamic, macro data-flow based, implementation techniques;

3) full exploitation of cost calculus directed (performance driven) program rewriting and optimization techniques both at compile time and at run time.

New architectural models

Long term research in matching parallel architectures and software technology will concentrate on new PIM architectures and novel ways to exploit the PIM section in a composite NUMA+PIM organization. The utilization of PIM as a fine grain coprocessor (as in the current phase of our research) is important for increasing performance of intensive calculation applications. However, PIM is an architectural concept that should be exploited much better: in particular, the support to system functionalities is a very promising area for new research tracks in parallel architecture and for the realization of innovative machines.

In previous sections it has been shown the importance of memory-oriented models for parallel architectures. The following issues can be individuated in this area:

1) memory hierarchy management, including very fast support to data transfers between levels of the hierarchy, and compile-time optimizations in terms of block size, synchronization, prefetching and caching;

2) ?smart? implementation of powerful operations on shared objects;

3) very efficient support to I/O transfers, possibly driven by compile-time analysis, and to pre- and post-processing on large data sets, possibly according to specific applications requirements (e.g. data mining and data warehousing);

4) support to multithreading, i.e. how to provide the NUMA nodes of a composite NUMA+PIM machine support to locality exploitation of thread contexts.

Shared memory on distributed memory machines

We plan to achieve the following results, in the PQE2000 framework:

1) integration of the shared memory hardware/firmware support in the whole architecture;

2) extension of the shared memory mechanisms with reference to the PIM architectural model;

3) caching and replication strategies to speed up the operations and their relations with the underlying architecture;

4) integration of caching and replication strategies for the shared data with the compilation strategy to implement the high level concurrent programming languages.

The results in the tree tracks will be evaluated on intensive application benchmarks (namely in data mining, multimedia servers, earth observation, scientific simulations) and, besides other research projects (Mosaico), exploited in the realization of a prototype of the new PQE2000 architecture and programming enviroment, with the purpose to favor its transfer into advanced industrial product.

Short CVAs

Staff

Marco Vanneschi

Marco Vanneschi graduated in Electronic Engineering at the University of Pisa in 1970. In 1973 he joined the University of Pisa as assistant professor in Computer Architecture. Since 1981 he is full professor in Computer Architecture at the same Department. His research and teaching activity has been in the area of computer architecture at several levels of abstraction. Specific research activities have been undertaken in the areas of firmware and microprocessing, data-flow models, and in parallel and scalable machines, in particular on parallel architectures, programming models and methodologies, development and evaluation tools for high performance applications.

He has been the coordinator of the National Project on Parallel Architectures (1989-1994) of the National Research Council (CNR). He is member of IFIP WG 10.3 on Parallel and Distributed Processing, and of the coordination group on Software Technology of the Parallel Processing Network of ERCIM. He has been often invited to reviewing activities of the EU. He is author of more than 150 articles published in international journals and conference proceedings, and of four books on basic computer architecture, advanced computer architecture, and parallel programming. He is the leader of the group for Architectures and Programming Tools for Parallel Computers at the Department of Computer Science in Pisa. Since 1995 he is the Scientific Director of the PQE2000 Project, an initiative sponsored by CNR, INFN, ENEA and Finmeccanica for the realization of a new European technology for massively parallel processing at the research and at the industrial level. During 1997 he has coordinated the preparation of a proposal for a National Programme on High Performance Computing of the Ministry for the University and for the Scientific and Technological Research (MURST).

Silvano Antonelli

Since 1985 has been associated professor at the Computer Science Department of the University of Pisa. He is currently the VicePresident of "Centro Interdipartimentale di Servizi Informatici per l'Area Umanistica (CISIAU)". His research interest concern computational complexity of static mapping and image processing algorithms.

Fabrizio Baiardi

Since February 1989 he has been an associate professor with the Department of Computer Science, University of Pisa. His main research interest is the definition of general purpose parallel architectures. In particular, his research is focused on the design and the evaluation of run time environments and of other programming tools for concurrent programming paradigms, and on the integration of hardware and firmware mechanisms in the run time environment of concurrent programming paradigms.

He has been responsible of the research activity of the Deptartment of Computer Science in the CNR National Project on Parallel Architectures and , in the same program, he coordinated the activities on ?Highly Parallel Architectures and Operating Systems?. He currently coordinates the activities on the distributed shared memory subsystem in the PQE2000 project. He has published more than 40 papers on parallel and high performance systems.

Marco Danelutto

Marco Danelutto graduated in Computer Science at the University of Pisa in 1984. He got the PhD in Computer Science in 1990. He has been a research associate at the Department of Computer Science of Pisa from 1992 to 1998. In 1998 he became associated professor. His main research interests concern (parallel) functional programming, structured parallel programming, implementation techniques for structured parallel programming models, optimization techniques based on program rewriting and transformation, network computing and hardware related techniques for fine grain parallelism exploitation. He has been the one of the co-designers of the structured, skeleton based, parallel programming languages P3L and SkIE-CL, and actively participated to the development of the template-based compiling tools.

Currently he is involved in the Italian PQE2000 and MOSAICO projects, related to High Performance and Network and Mobile Computing topics, and to a Galileo project with Italian and France University and Research institutions, related to parallel functional programming models. He is author of about 30 papers appearing on different journals and conferences proceedings.

Laura Ricci

Laura Ricci graduated in Computer Science at the University of Pisa, April 1983. She has worked from April 1983 to September 1985 to the development of a compiler for message passing languages, within the project "Mumicro", supported by the Italian National Research Council. From November 1985 to November 1989 she has attended the courses of Ph.D in Computer Science and she has obtained the Ph.D degree in September 1990, with the dissertation: "Compilation of Logic Programs for Massively Parallel Systems?. From September 1992 she is an Assistant Professor at the Department of Computer Science, University of Pisa, where she teaches courses on Computer Architectures and on Advanced Compilation Techniques. Her current research interests include software supports for parallel architectures, parallelizing compilers and the development of environments to teach parallel processing.

PhD Students

Marco Aldinucci

Marco Aldinucci graduated in Computer Science at the University of Pisa in 1997. He had a post-graduate research fellowship within the PQE2000 project in 1997. In 1997 he began his PhD studies at the University of Pisa, Dept. of Computer Science. His main research subjects are in the area of high-performance and parallel computing. In particular memory oriented architectures, Processing-in-Memory architecture, structured parallel programming,and optimisation techniques based on program rewriting and transformation.

Massimo Coppola

Massimo Coppola graduated in Computer Science at the University of Pisa (April 1997). After graduation, he got a six months of post-graduate research fellowship within the PQE2000 project. He began the Ph.D. studies in November 1997. His research fields include parallel computer architectures, run-time support design, compile and optimization algorithms for skeleton-structured parallel languages, parallel Data Mining (Association Rules and Classification) and Constraint Programming techniques applied to optimization problems related to structured parallel programming.

Paolo Mori

He graduated in Computer Science at the University of Pisa, on April 1998. He had a fellowship of Consorzio Pisa Ricerche from July 1998 to March 1999. Since the beginning of 1999 he is a PhD student in Computer Science at the University of Pisa. His research intererests concern load balancing and locality in parallel algorithms for irregular problem resolution.

Andrea Zavanella

In October 1993 he graduated in Computer Science at the University of Pisa. In 1994 he worked with a temporarily grant at the Department of Computer Science. In 1995 he received a grant by the Consorzio Pisa Ricerche to work on the porting of parallel programming system P3L. In 1996 he began the PhD studies on parallel computing under the supervision of the Prof. Marco Vanneschi. His research interests are high level and portable parallel programming systems, skeletal programming and abstract models for parallel computation such as BSP and LogP. Since November 1998 he also work in a joint project with the Prof. Murray Cole of the Edinburgh University on nested data parallel computation, in the framework of a TMR grant (Marie Curie).

Recent and selected bibliography of group members

[ACD98] M. Aldinucci, M. Coppola, M. Danelutto, "Rewriting skeleton programs: how to evaluate the data-parallel stream-parallel treadoff", International Workshop on Constructive Methods for Parallel Programming, 1998, Technical Report University of Passau, MIP-9805

[AD99] M. Aldinucci, M. Danelutto, "Stream parallel skeleton optimisation", Tech. Rep. MOSAICO/PISA-DI/1/99

[BCD+97] B. Bacci, B. Cantalupo, M. Danelutto, S. Orlando, D. Pasetto, S. Pelagatti, M. Vanneschi, "An environment for structured parallel programming". In Advances in High Performance Computing, Kluwier, The Netherlands, 219-234, 1997

[BDP92] B. Bacci, M. Danelutto and S. Pelagatti, "Resource Optimization via Structured Parallel Programming", Proceedings of Programming Environments for Massively Parallel Distributed Systems, K. M. Decker and R. M. Rehmann editors, Birkhauser, 1994

[BDP+95] B. Bacci, M. Danelutto, S. Orlando, S. Pelagatti, M. Vanneschi. ?P3L: a structured parallel programming language and its structured support?. Concurrency Practice and Experience , 7 (3), 225-255, May 1995

[BDP+99] B. Bacci, M. Danelutto, S. Pelagatti and M. Vanneschi, ?SkIE: a heterogeneous environment for HPC applications?, to appear in Parallel Computing, 1999

[BCV99] P.Becuzzi, M.Coppola, M.Vanneschi, "Mining of Association Rules in Very Large Databases: a Structured Parallel Approach", accepted to Euro-Par '99 conference. To appear on LNCS.

[BJ93] F. Baiardi, M. Jazayeri , "P3M: a Virtual Machine Approach to Massively Parallel Computation", 1993 Int. Conf. on Parallel Computing, Chicago, USA, Agosto 1993.

[BCR94] F. Baiardi , A. Candelieri , L.Ricci, "Congestion Prevention by Bounding in Distributed Memory Systems", World Transputer Conference, Sept. 1994

[BR95] F.Baiardi, L.Ricci, "A Static Analysis to Order Instructions of a Concurrent Program", Int. Conf. On Parallel and Distributed Technology and Applications, Athens, Georgia (USA), Novembre 1995

[BCR96] F. Baiardi, A.Candelieri, L. Ricci, "Massively Parallel Execution of Logic Programs: A Static Approach", Int. Jour. of Sistem Architectures, Febbraio 1996.

[BBM+98] F. Baiardi, P. Becuzzi, P. Mori, M. Paoli, "Load Balancing and Locality in Hierarchical N-Body Algorithms on Distributed Memory Architetture", Proceedings of the High Performance Network Computing Europe, Amsterdam April 1998, Lecture Notes in Computer Science 1401: 284-293, Springer-Verlag Inc.

[BBG+99] F.Baiardi, C.Bernasconi, D.Guerri, L.Ricci, "Compiling data Parallel Loops on Virtual Shared Memory architectures" submitted to PDCS'99.

[CCL+99] G.Capretti, A. Cisternino, M.R. Laganà, L.Ricci. "A Concurrent Microworld", to be presented at ED-MEDIA99, Seattle June 1999

[CLR98] G.Capretti, M.R.Laganà, L.Ricci,"A microworld to learn concurrency" International Conference SSCC'98- Durban, September 1998

[CLR99] G.Capretti, M.R. Laganà, L.Ricci "Learning Concurrent Programming: A Constructionist Approach", to be presented at PaCT 1999, St. Petersburg, September 1999.

[CDF+97] S. Ciarpaglini, M. Danelutto, L. Folchi, C. Manconi and S. Pelagatti, ?ANACLETO: a template-based P3L compiler?, Proceedings of the PCW'97 , Camberra, Australia

DDO+92] M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti and M. Vanneschi", "A methodology for the development and support of massively parallel programs", Future Generation Computer Systems, North Holland, Vol. 8, No 1-3, pages 205-220, July 1992

[DDL+98] M. Danelutto, R. Di Cosmo, X. Leroy and S. Pelagatti, ?Parallel Functional Programming with Skeletons:the OCAMLP3L experiment?, Proceedings of the 1998 ACM Sigplan Workshop on ML, Baltimore, USA, 1998.

[LL99] Lari, L.Ricci, " Array Region Analysis by Abstract Interpretation Approaches" submitted to PACT 1999 , Newport Beach, October 1999.

[OV95] S. Orlando, M. Vanneschi, "A synchronous model for distributed memory multiprocessor architectures". Proc. PARCO 1995, Springer Verlag, pp. 156-168.

[PV96a] F. Petrini, M. Vanneschi, "Minimal Adaptive Routing with Limited Injection on Toroidal k-ary n-cubes". In Supercomputing 96, Pittsburgh, PA, November 1996.

[PV96b] F. Petrini, M. Vanneschi, "Latency and Bandwidth Requirements of Massively Parallel Programs: FFT as a Case Study". In 2nd International Euro-Par Conference, Lyon, France, August 1996. Lecture Notes in Computer Science, Vol. 1123, pp. 307-312

[PV96c] F. Petrini, M. Vanneschi, "Minimal vs non Minimal Adaptive Routing on k-ary n-cubes". In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'96), volume I, Sunnyvale, CA, August 1996, pp. 505-516.

[PV97a] F. Petrini, M. Vanneschi, "Efficient Personalized Communication on Wormhole Networks". In 1997 International Conference on Parallel Architectures and Compilation Techniques , PACT'97, San Francisco, CA, Nov 1997.

[PV97b] F. Petrini and M. Vanneschi, "Efficient Total-Exchange in Wormhole-Routed Toroidal Cubes". In 3rd International Euro-Par Conference, Passau, Germany, August 1997.

[PV97c] F. Petrini and M. Vanneschi, "Network Performance under Physical Constraints". In Proceedings of the International Conference on Parallel Processing 1997, ICPP'97, Bloomingdale, IL, August 1997.

[PV97d] F. Petrini, M. Vanneschi, "SMART: a Simulator of Massive ARchitectures and Topologies". In International Conference on Parallel and Distributed Systems, Euro-PDS'97, Barcelona, Spain, June 1997.

[PV97e] F. Petrini, M. Vanneschi, "k-ary n-trees: High Performance Networks for Massively Parallel Architectures". In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97, pages 87-93, Geneva, Switzerland, April 1997.

[PV97f] F. Petrini, M. Vanneschi, "Performance Analysis of Minimal Adaptive Wormhole Routing with Time-Dependent Deadlock Recovery". In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97, pages 589-595, Geneva, Switzerland, April 1997.

[PD97] F. Petrini, J. Duato, P. Lopez, J.M. Martinez, ?LIFE: a Limited Injection, Fully adaptive Recovery-Based Routing Algorithm?, High Performance Computing 97 (HiPC'97), Bangalore, India, December 1997.

[PV98a] F. Petrini, M. Vanneschi, "Performance Analysis of Wormhole Routed k-ary n-trees''. International Journal on Foundations of Computer Science, (10): 127-141, October 1998.

[PV98b] F. Petrini, M. Vanneschi, "Communication performance of Fat-trees and Toroidal Cubes". International Journal on System Sciences, (11): 327-340, November 1998.

[PV99] F. Petrini, M. Vanneschi, "Latency and bandwidth requirements of massively parallel programs: FFT as case study". To be published in Future Generation Computer Systems .

[Van98] M. Vanneschi, "Variable grain architectures for MPP computation and structured parallel programming". 3rd International Working Conference on Massively Parallel Programming Models, London, Nov. 12-14, 1997. Published as Massively Parallel Programming Models by IEEE Computer Society, 1998, pp. 132 - 140 (invited paper).

[Van98b] M. Vanneschi, "Heterogeneous HPC environments"., 4th Int. Euro-Par Conference, Southampton, Sept. 1998, in D. Pritchard and J. Reeve (Eds.), Lecture Notes in Computer Science, vol. 1470, pp. 21-34 (invited paper).

[Van98c] M. Vanneschi, "PQE2000 Project: HPC Tools for Industrial Applications". IEEE Concurrency , IEEE Computer Society, October-December 1998, pp.68-73 (invited paper).

[Van99] M. Vanneschi, "Matching architecture and software technology for HPC systems". Euromicro Workshop on Parallel and Distributed Processing, IEEE Computer Society, Madeira, February 1999, pp. 2-12 (invited paper).

[ZGB98] A. Zavanella and A. Giani and F. Baiardi, "On Dropping Niches in Parallel Niching Genetic Algorithms", Genetic Programming 1998: Proceedings of the Third Annual Conference , pp. 618-620, Morgan Kaufmann, 22-25 July 1998

[ZP99] A. Zavanella and S. Pelagatti "Using BSP to Optimize Data Distribution in Skeleton Programs", High Performance Computing and Networking 1999: Proceedings of the Conference , pp 317-327, Lecture Notes in Computer Science, April 1999

[Zav99] A. Zavanella, "Optimising Skeletal-Stream Parallelism on a BSP Computer", accepted to Europar99. To appear in LNCS.



Index Page