- Thumbnails of all Introduction slides
- HTML slide show of all Introduction slides
- Compressed Postscript of all Introduction slides
- Postscript of all Introduction slides

There are several classes of parallel machines, including **distributed
memory machines** and **shared memory machines**.
Both types have independent processors that can run at the same
time. The machine pictured at left, for example, has over 2000 processors.
The
machine only reaches its advertised peak speed if all the processors are
executing concurrently (think of them as running in parallel lines - hence
the term "parallel computing").
The
key differences between them are the memory subsystems. Distributed
memory machines give each processor a private memory that is inaccessible
to any other processor. Any coordination between processors has to be
accomplished by explicitly sending and receiving messages, which typically
takes tens to thousands of CPU cycles.
Shared memory machines allow all processors to access a global memory (or
several global memory modules). Information can be passed through this
memory, making programming easier.
However, making this global memory fast usually requires local cacheing of
data on the processors, so it is still a good idea to take advantage of
data locality.
Early languages for parallel machines reflected the underlying hardware
fairly strongly. Distributed memory machines typically forced the
programmer to identify both the sender and receiver of all messages, while
shared memory languages used barriers and signals to coordinate processors.
While this led to efficient programs, it also made programs non-portable.

HPF uses **data parallelism** to allow users to write
portable parallel programs. The main idea is that the parallelism comes
from independent operations on all (or most) of the elements of large data
structures. Since HPF is based on Fortran 90, the data structures are
usually arrays. The parallel operations are often array arithmetic,
although we'll see later that this can be generalized somewhat. Since
there are typically tens (perhaps hundreds) of processors and thousands (or
millions) of array elements, the elements have to be divided among the
processors. HPF does this by high-level data mapping directives. The
HPF compiler uses these hints from the user to decide which processor
should perform each elemental operation. Basically, the compiler
schedules the computation so that as many elements are local to the
processor executing it as possible.

The next three sections of the tutorial look at the major pieces of HPF:

- Features inherited from Fortran 90
- Features that implement data parallelism
- Features that declare data mapping

- Thumbnails of all Fortran 90 slides
- HTML slide show of all Fortran 90 slides
- Compressed Postscript of all Fortran 90 slides
- Postscript of all Fortran 90 slides

- Dynamic Memory Allocation
- The
`ALLOCATE`

statement can allocate memory (including arrays) from the heap. This allows the kind of structured memory management found in other modern languages. It helps HPF by making it easier to declare arrays to be their "natural" size (which in turn affects their mapping). - Array Operations
- Array assignment can update the values of entire arrays or (strided) rectangular sections of arrays. Similarly, array arithmetic makes it easy to add corresponding elements of two arrays.
- New Intrinsics
- Intrinsic functions such as
`SUM`

and`MAXVAL`

perform reductions on whole arrays. Other intrinsics perform data movement (`CSHIFT`

for circular shift) and other useful operations (`MATMUL`

for matrix multiplication). - Procedure Interfaces
- Explicit interfaces give the compiler more information about the arguments passed. They also enable new styles of parameter passing, including assumed-shape arrays which do not require the array bounds to be passed by the user.

`POINTER`

variables (more powerful than Pascal's pointers, but
more structured than C pointers), and record structures (called "derived
types").
- Thumbnails of all Parallelism slides
- HTML slide show of all Parallelism slides
- Compressed Postscript of all Parallelism slides
- Postscript of all Parallelism slides

- The
`FORALL`

statement, a generalization of element-wise array assignments `PURE`

functions, which can be used within a`FORALL`

to provide larger-grain operations- The
`INDEPENDENT`

directive, which asserts that a loop can be executed in parallel.

The `FORALL`

statement was adapted from proposals made during
the Fortran 90 development process and from a popular feature of the
Connection Machine Fortran compiler.
While its form resembles a `DO`

statement, its meaning is very
different.
The `FORALL`

statement

FORALL ( I = 2 : 4 ) A(I) = A(I-1) + A(I+1) C(I) = B(I) * A(I+1) END FORALLexecutes as follows:

- Evaluate
`A(I-1) + A(I+1)`

for`I = 2`

,`3`

, and`4`

. (These are the green ovals in the picture at left.) - Assign the results to the corresponding
`A(I)`

locations. (Blue ovals.) - Evaluate
`B(I) * A(I+1)`

for the three values of`I`

. (Red ovals) - Assign the new results to the corresponding
`C(I)`

locations. (Yellow ovals.)

`FORALL`

. Notice that, because there are no arrows across a
row, that all operations in one row can be performed in parallel. For
example, the three red ovals could each execute on a separate processor.
Because of the arrows between levels, however, there may have to
be synchronizations between the computations and the actual assignments.
The `FORALL`

example is different from the similar-looking

DO I = 1, 3 A(I) = A(I-1) + A(I+1) C(I) = B(I) * A(I+1) END DOwhose dependence diagram is shown at right. Although there are fewer constraints in the

`DO`

loop, the ones
that
One other aspect of `FORALL`

is hard to see in the small
figures. Many of the worst-case dependences in the `FORALL`

do
not occur in any particular program; these are marked in gray, rather than black. If a
compiler can detect this, it can produce more efficient parallel
code by avoiding the synchronization points.
(The same is true of a `DO`

loop; that's how
vectorizing compilers work.)

If you are familiar with Fortran 90, you may notice that `FORALL`

has the same semantics as an array assignment (or a series of array
assignments). In fact, any array assignment can be
written as a `FORALL`

; however, `FORALL`

can express
some array subsections, like diagonals, that array assignments can't.
In practice, `FORALL`

is a convienient way to express many
array update operations.
Some users find it easier to understand than array assignments,
especially assignments to array subsections; others feel exactly the
opposite.

One of the constraints on `FORALL`

statements is that they can
only contain assignments. Conditionals, nested loops, and `CALL`

s
are not
allowed. Also, it would be unsafe for a statement in a `FORALL`

to invoke an arbitrary function - if the function had side effects, the
results could be completely messed up.
These are both serious limitations on `FORALL`

functionality,
and would probably be unacceptable to programmers.
Therefore, HPF also introduces a new class of functions called
`PURE`

functions which cannot have side effects.
The code inside the `PURE`

function can have arbitrarily complex
control flow,
including conditionals and loops. However, it cannot assign to any global
variables or dummy arguments, nor can it use globals or dummies anywhere
that their values might change (for example, as a `DO`

loop
index).

The `INDEPENDENT`

directive takes a different approach to
specifying data parallelism. Rather than define a new statement with new
semantics, `INDEPENDENT`

gives the compiler more information
about existing Fortran statements. In particular, `!HPF$ INDEPENDENT`

before a `DO`

loop tells the compiler that the iterations of
the loop do not affect each other in any way. (The `!HPF$`

means that `INDEPENDENT`

is advice to the compiler, rather than a
first-class statement.) So the code block

!HPF$ INDEPENDENT DO J = 1, 3 A(J) = A( B(J) ) C(J) = A(J) * B(A(J)) END DOcan be run with each iteration on a separate processor and no synchronization during the loop. The worst-case dependence diagram for this loop is shown at right. Again, it is possible for some of the dependence arcs to be missing in a particular case, but it is less common for compilers to try to take advantage of this. Instead, HPF compilers concentrate on partitioning the iterations between processors. A common heuristic for this is the "owner-computes rule," in which the processor that owns one of the left-hand sides takes responsibility for all computations in an iteration.

`INDEPENDENT`

is most useful for giving the compiler
application-specific knowledge that can be used to run a loop in parallel.
For example, the figure at left illustrates the code segment

!HPF$ INDEPENDENT, NEW(J) DO I = 1, N DO J = IBEGIN(I), IEND(I) X(IBLUE(I)) = X(IBLUE(I)) + X(IRED(J)) END DO END DO(The

`NEW`

clause means that `J`

is local to the
outer loop.)
The programmer is telling the compiler that the elements `X(IBLUE(I))`

and `X(IRED(J))`

are distinct for all values of `I`

and `J`

. Also, the `IBLUE`

array cannot have any
repeated values. Essentially, `IBLUE`

and `IRED`

arrays
have to partition the `X`

array as shown in the lower left.
This is enough for the compiler to run the outer loop in parallel. It is
also information that the compiler could not discover on its
own; it depends on the input data (and, presumably, on the program's data
structures and algorithms).
If
`IBLUE`

and `IRED`

don't correctly partition the
array (shown on the lower right), then it is a programming error.
Unfortunately, it will be a difficult error to find.

The HPF library and intrinsics are the final set of data-parallel features
in the language. These are operations that have 3 important properties:

- They are useful in scientific computation
- They have efficient implementations on parallel machines
- The efficient implementations are complex, subtle, or machine-dependent

`SUM`

at the top of the
figure; actually a Fortran 90 intrinsic), prefix operators
(`SUM_PREFIX`

in the middle; each element is the sum of the
elements before it), and scatter operators (`SUM_SCATTER`

at
the bottom; arbitrary shuffling and combining of data). In addition, HPF
offers sorting functions in the library.

- Thumbnails of all Data Mapping slides
- HTML slide show of all Data Mapping slides
- Compressed Postscript of all Data Mapping slides
- Postscript of all Data Mapping slides

- The
`ALIGN`

directive does fine-grain matching between separate arrays; when two array elements are aligned together, they will always be stored on the same processor - The
`DISTRIBUTE`

directive partitions an array among processors; this mapping "drags along" any other arrays aligned with it.

`ALIGN`

and use `DISTRIBUTE`

for all their arrays.
The `DISTRIBUTE`

directive partitions an array by specifying a
regular distribution pattern for each dimension.
The basic form of a `DISTRIBUTE`

directive on a 2-dimensional
array is

!HPF$ DISTRIBUTE array_name(pattern1, pattern2)The figure at left shows four possible distributions of a 12 by 12 array by giving elements on the same processor the same color. The

`BLOCK`

pattern breaks the dimension into equal-sized blocks, one per processor.
For example, the
upper left array in the figure uses a BLOCK distribution on rows. The
`CYCLIC`

pattern is at the other end of the spectrum - it
spreads the elements one per processor, wrapping around
when it runs out of processors. The upper left array is `CYCLIC`

in the column dimension.
Dimensions that are not divided between processors are denoted by
`*`

. Therefore, the declarations for the top two arrays are
!HPF$ DISTRIBUTE A(BLOCK,*) !HPF$ DISTRIBUTE B(*,CYCLIC)

`CYCLIC`

can also take a parameter `K`

,
which causes it to dole out
groups of `K`

instead of single elements.
The lower right array therefore uses the pattern
`(CYCLIC(2),CYCLIC(3)`

. The (extremely common) distribution at
lower left is `(BLOCK,BLOCK)`

.
The purpose of choosing a distribution is to control the performance of the
code. Because of the owner-computes strategy of most HPF
compilers, the dimension(s) of an array with the most potential data
parallelism should be divided among processors (i.e. they should use a
pattern other than `*`

). This will assure that operations that
can run in parallel will run in parallel.

To decide which pattern to use, you need to consider the data movement that
will be generated. The goal is to minimize the amount of data that has to
move between processors. (Even on shared memory machines, data may have
to move from one processor's cache to another's cache.) The figure at
right shows how to estimate this graphically. Consider which elements are
used together (for example, appear in the same expression). If the
elements are on the same processor, there should not be data movement; if
they are on different processors, at least one of them must move. The
figure uses color-coding for processors, and shows communications
as dark lines for several common subscript patterns. Although the
analysis can become tedious, the following rules of thumb cover many
programs:

`BLOCK`

distribution is good for nearest-neighbor communication, such as local stencils on regular meshes.- If two arrays have size
`N`

and`c*N`

(where`c`

is a compile-time constant),`BLOCK`

(for both arrays) works well for linear interpolation (e.g.`A(I) = B(I*c)`

) between the arrays. `CYCLIC`

(or`CYCLIC(k)`

) is good for load balancing, but does not preserve data locality between nearest neighbors.`CYCLIC`

creates data locality for long-stride accesses (e.g.`A(I) * A(I+64)`

) if the stride is a multiple of the number of processors. (This occurs in practice in recursive doubling algorithms on machines with a power of 2 processors.)- Copying data between arrays with different distributions usually moves almost all the data.

The `ALIGN`

directive matches individual array elements to other
array elements or to slices of arrays.
The syntax is

!HPF$ ALIGN array1(i,j) with array2(i*a+b, j*c-d)Given this directive, only

`array2`

can appear in a
`DISTRIBUTE`

directive.
The "dummy arguments" on `aray1`

can be any integer variables,
and the expressions after `array2`

can be any affine linear functions
of the dummies. In many programs, the linear functions are simply the
dummy arguments; that is, `ALIGN`

uses an identity mapping
rather than a more complex one. `ALIGN`

can also use the
`*`

to represent "all values in that dimension". This allows
replication of array elements across several processors.
!HPF$ ALIGN X(I,J) WITH W(J,I) !HPF$ ALIGN Y(K) WITH W(K,*) !HPF$ ALIGN Z(L) WITH X(3,2*L-1)Elements are mapped together if they contain the same symbol in the same color. (This is much easier to see in the full-size slide.) So,

`X`

is aligned with the transpose of `W`

;
`X(1,6)`

and `W(6,1)`

both contain a magenta square.
Each element of `Y`

is aligned with an entire row of
`W`

, so `Y(2)`

contains six yellow symbols
corresponding to the six yellow elements in `W`

's second row.
`Z`

is ultimately aligned to every other element in
`W`

's third column, that is to three of the six colored
triangles. The lower two-thirds of the figure shows the effect on all
arrays of two distributions for `W`

.

Two extensions of the above distribution mechanism are important to HPF.
The first is dynamic data mapping. For some arrays, a single distribution
is not appropriate for the entire program. For example, an efficient
2-dimensional FFT might distribute the array by rows to process the X
dimension, then distribute the array by columns to perform the Y-direction
operations. Another example is using runtime information (such as the
size of an array) to pick the appropriate mapping.
Recognizing this need, HPF provides "executable" versions of the
`ALIGN`

and `DISTRIBUTE`

directives called
`REALIGN`

and `REDISTRIBUTE`

. The syntax and
semantics are the same as for the "declaration" directives, except that
the new forms apply from the time it is executed until another mapping
directive is encountered. That is, the "`RE`

" forms have
dynamic scope while the normal forms have static scope. Users should be
aware, however, that `REALIGN`

and `REDISTRIBUTE`

will generally cause heavy data movement if they are used.
The other extension to the basic mapping capabilities applies to dummy
arguments. There are several things that one might want to do with a
dummy's mapping:

- Remap the dummy to the procedure's preferred distribution. Some routines may be very inefficient if the data is not on the right processors.
- Leave the data where it is. Some routines are very flexible.
- Leave the data where it is, and tell the compiler where that is. Special-purpose routines may only need to accept one incoming distribution.

- Thumbnails of all Parallel Programming in HPF slides
- HTML slide show of all Parallel Programming in HPF slides
- Compressed Postscript of all Parallel Programming in HPF slides
- Postscript of all Parallel Programming in HPF slides

Gaussian elimination illustrates array
operations and load balancing for dense linear algebra. The algorithm
solves a system of linear equations by successively updating the matrix
that represents the system. The figure at left shows the pattern of these
updates: at each stage, the blue section of the array is
updated using its own values and the values of the red elements. The
updating operation itself is data-parallel, and thus a good choice for
HPF. There is no data locality in the computation, since each red element
updates several blue ones. However, since the amount of work shrinks at
each stage, it is important to choose a distribution that keeps as many
processors busy as possible. In this case, that is the `CYCLIC`

data distribution.

Jacobi iteration illustrates nearest-neighbor communication. The algorithm
solves a simple PDE by discretizing it on a regular mesh and approximating
the equation at each mesh point by a finite difference form. The resulting
set of equations are solved iteratively by updating each array element
using the stencil shown in the figure. Again, this is a classically
data-parallel operation. Unlike Gaussian elimination, however, there is
substantial data locality and no load imbalance to worry about. Such
computations' performance is optimized by HPF's `BLOCK`

distribution.

Conjugate gradient expands
Jacobi to a more realistic numerical method.
The new method uses the same discretization and finite difference
formulation as before, but formulates the problem as an error minimization.
This allows some sophisticated mathematics to be brought to bear on the
problem, leading to a much more efficient iterative scheme. Each
iteration of the conjugate gradient iteration requires a Jacobi-like
computation of the residual, two dot products, and several vector updates.
All these operations are data-parallel. The best data distribution is
determined by the residual, prompting the use of `BLOCK`

distribution here. More complex problems could be solved by the conjugate
gradient method by replacing the residual routine and choosing a data
distribution appropriate to the new operator.

Finally, relaxation on an irregular mesh
shows what is (and what is not) supported in HPF for irregular
computations. The algorithm updates the nodes in a graph based on values
stored in the edges. The data parallelism is trickier here than in the
previous examples, since each node may be updated by several edges.
However, the HPF library routine `SUM_SCATTER`

performs this
operation efficiently. The cost of the scatter will depend on how the
nodes and edges are mapped to processors. Conceptually, one would like
to place the nodes individually. However, this is not possible with HPF's
regular distributions. What is possible is to renumber the nodes so that
nodes near each other in the graph are also near each other in the array.
After this permutation, a simple `BLOCK`

distribution will tend
to enhance data locality.
Similarly, the edges can be reordered so they
land on the same processor as their endpoints. Such a reordered mesh is
shown in the figure at left. While this example shows that irregular
computations are possible in HPF, it is clear the the program is far from
elegant. Some of the extensions under development for HPF 2 address these
issues.

- Thumbnails of all HPF Version 2.0 slides
- HTML slide show of all HPF Version 2.0 slides
- Compressed Postscript of all HPF Version 2.0 slides
- Postscript of all HPF Version 2.0 slides

- Support for task (control) parallelism
- Support for irregular data distribution (and more complex data structures in general)
- Support for parallel I/O

- HPF requires a complex compiler to provide a good implementation
- Many new features of HPF will be even more difficult to implement than the current language

Back to
main tutorial

Last modified: April 30, 1996

Chuck Koelbel

chk@cs.rice.edu