To exploit at best the performance of such parallel architectures, we are forced to design algorithm that take into account both the memory hierarchy and the hierarchical nature of parallel costs.
Computation models should exhibit the right balance of abstraction and detail, to reflect the actual behavior of parallel algorithms while keeping the analysis tractable. In this perspective the parallel random access machine (PRAM) model has proven to be convenient for algorithm design, but mis-guiding with respect to computational costs. Several variants of the PRAM model have been devised with the aim of reconciling theoretical computational costs with real performance.
Useful parallel models should satisfy both efficiency and universality requirements, so that computational costs are essentially unaffected by the underlying architecture. This has lead to the concept of parallel bridging model (PBM), of which the BSP was a first example . Other PBMs are the CGM (an extension of the BSP), LogP and QSM.  is a survey about the development of architecture independent models for parallel computation (including many PRAM extensions).
PBMs reward parallel slackness and aggregate communication patterns, so it is not surprising that they are closely related to external memory models. We describe some works that explore the relation among different PBMs (mostly BSP extensions) and with external/hierarchical memory computation models.
The original BSP as defined by Valiant  is described first. De La Torre and Kruskal  introduce the D-BSP model, which rewards locality of computation by adding hierarchical decomposition into smaller BSP-submachines.
Beran  compares the D-BSP and BSP models. It concludes that with a realistic set of parameters, there is a class of algorithms that runs asymptotically faster on the D-BSP model. For instance, sequential pipelined algorithms can be turned into parallel algorithms with bounded period on the D-BSP, but not on the BSP. Bilardi and others  also examine the D-BSP model concluding that it offers the same design advantages of BSP, but has higher effectiveness and portability over realistic parallel architectures.
Meyer auf der Heide and Wanka  investigate the relation among the BSP* and the D-BSP models. Both models aim at more realistic computational cost than the original model. Here it is also stated that the Paderborn BSP library (PUB) has additional features that allow the direct implementation of BSP* and D-BSP programs (block-wise communications, partitioning and subset synchronizations).
Sibeyn and Kaufmann  show that algorithms developed for the BSP model can easily be turned into efficient external memory algorithms, and there are cases in which optimality is preserved.
Dehne and others in  show that BSP-like (BSP*, CGM) algorithms can be efficiently transformed into sequential or parallel EM algorithms (provided that there is enough parallel slackness).
Summing up, there are two reasons for the interest in this kind of models
A useful model in this setting has to define the interaction among the two hierarchies. A large number of algorithms have been designed for the well known PDM model, which is a two level memory model that uses minimal hypothesis about the interconnection network . Here we describe in more detail two works that aim at a general model of parallel, hierarchical computation.
Aggarwal and Plaxton  develop a general model of parallel HM computation, where a reduced number of primitives is used to exchange data. These primitives can be efficiently implemented both in a HM and in a range of parallel networks. The paper presents a sorting algorithm that matches or improves over previously known lower bounds for HM sorting.
On the other hand, Dehne and others in  add a two-level external memory to the hierarchical parallel model D-BSP, and apply it to develop a parallel distribution sweeping algorithm, which optimizes both communication and memory access locality. In this case the parallel model is more general, but the memory hierarchy is simpler.