Abstract models for hierarchical parallel machines

- The first communication layer is at the node level, where data exchange and synchronization is comparable to the memory access time, and data sharing has a very low cost. The overhead for parallel computation can be quite low if threads are exploited.
- The second level is the (high-throughput) communication network used to build the cluster. The bandwidth and latency of the network is higher than that of the in-core memory. Some dedicated interconnection networks also exhibit a hierarchical structure (e.g. hypercube, butterfly or fat-tree networks). Refer to [1,2].
- A third level is the connection to other machines via a LAN or a WAN. On a local scale, file servers are the simplest example, but in a computational grid environment we may need to exploit several clusters as components of a larger machine.

To exploit at best the performance of such parallel architectures, we are forced to design algorithm that take into account both the memory hierarchy and the hierarchical nature of parallel costs.

Computation models should exhibit the right balance of abstraction and detail, to reflect the actual behavior of parallel algorithms while keeping the analysis tractable. In this perspective the parallel random access machine (PRAM) model has proven to be convenient for algorithm design, but mis-guiding with respect to computational costs. Several variants of the PRAM model have been devised with the aim of reconciling theoretical computational costs with real performance.

Useful parallel models should satisfy both efficiency and universality
requirements, so that computational costs are essentially unaffected
by the underlying architecture. This has lead to the concept of *parallel bridging model* (PBM), of which the BSP was a first example
[3]. Other PBMs are the CGM (an extension of the
BSP), LogP and QSM. [4] is a survey
about the development of architecture independent models for parallel
computation (including many PRAM extensions).

PBMs reward parallel slackness and aggregate communication patterns, so it is not surprising that they are closely related to external memory models. We describe some works that explore the relation among different PBMs (mostly BSP extensions) and with external/hierarchical memory computation models.

The original BSP as defined by Valiant [3] is described first. De La Torre and Kruskal [5] introduce the D-BSP model, which rewards locality of computation by adding hierarchical decomposition into smaller BSP-submachines.

Beran [6] compares the D-BSP and BSP models. It concludes that with a realistic set of parameters, there is a class of algorithms that runs asymptotically faster on the D-BSP model. For instance, sequential pipelined algorithms can be turned into parallel algorithms with bounded period on the D-BSP, but not on the BSP. Bilardi and others [7] also examine the D-BSP model concluding that it offers the same design advantages of BSP, but has higher effectiveness and portability over realistic parallel architectures.

Meyer auf der Heide and Wanka [8] investigate the relation among the BSP* and the D-BSP models. Both models aim at more realistic computational cost than the original model. Here it is also stated that the Paderborn BSP library (PUB) has additional features that allow the direct implementation of BSP* and D-BSP programs (block-wise communications, partitioning and subset synchronizations).

Sibeyn and Kaufmann [9] show that algorithms developed for the BSP model can easily be turned into efficient external memory algorithms, and there are cases in which optimality is preserved.

Dehne and others in [10] show that BSP-like (BSP*, CGM) algorithms can be efficiently transformed into sequential or parallel EM algorithms (provided that there is enough parallel slackness).

Summing up, there are two reasons for the interest in this kind of models

- good properties of cross-simulation have been shown for PBMs over sequential and parallel HM models
- parallel machines are increasingly becoming hierarchical, as we mentioned, and no model yet has been proposed that fully takes into account both memory and parallel hierarchy at the same time.

A useful model in this setting has to define the interaction among the two hierarchies. A large number of algorithms have been designed for the well known PDM model, which is a two level memory model that uses minimal hypothesis about the interconnection network [11]. Here we describe in more detail two works that aim at a general model of parallel, hierarchical computation.

Aggarwal and Plaxton [12] develop a general model of parallel HM computation, where a reduced number of primitives is used to exchange data. These primitives can be efficiently implemented both in a HM and in a range of parallel networks. The paper presents a sorting algorithm that matches or improves over previously known lower bounds for HM sorting.

On the other hand, Dehne and others in [13] add a two-level external memory to the hierarchical parallel model D-BSP, and apply it to develop a parallel distribution sweeping algorithm, which optimizes both communication and memory access locality. In this case the parallel model is more general, but the memory hierarchy is simpler.