OpenMP loop parallelism #
With a parallel region and identification of individual threads, we can actually parallelise loops “by hand”.
Suppose we wish to divide a loop approximately equally between all the threads, by assigning consecutive blocks of the loop iterations to consecutive threads.
Notice here that the number of iterations is not evenly divisible by the number of threads, so we’ve assigned one extra iteration to thread0.
The code that distributes a loop in this way looks something like the below.
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
static inline int min(int a, int b)
{
return a < b ? a : b;
}
int main(void)
{
const int N = 16;
int a[N];
for (int i = 0; i < N; i++) {
/* Sentinel for unhandled value */
a[i] = -1;
}
#pragma omp parallel default(none) shared(N, a)
{
int tid = omp_get_thread_num();
int nthread = omp_get_num_threads();
int chunk = N / nthread + ((N % nthread) > tid);
int start = tid * (N / nthread) + min(tid, N % nthread);
for (int i = start; i < start + chunk; i++) {
a[i] = tid;
}
}
for (int i = 0; i < N; i++) {
printf("a[%2d] = %2d\n", i, a[i]);
}
return 0;
}
Exercise
Run this code for a number of threads between 1 and 8. Convince yourself that it correctly allocates all the loop iterations!
UGH!
Fortunately, there is a better way1.
Worksharing constructs #
Suppose we are in a parallel region, to distribute the work in a loop
amongst the thread team, we use the #pragma omp for
directive.
void foo(/* Some arguments */)
{
#pragma omp parallel default(none) ...
{
...;
/* Loop to parallelise */
#pragma omp for
for (int i = 0; i < N; i++) {
...;
}
}
}
This directive does the job of dividing the loop iterations between the currently active threads. Without the directive, all threads in the team would execute all iterations.
Shorthand
This pattern of parallel region + loop is so common that there is a separate directive that merges the two
#pragma omp parallel for ...
for (...)
;
Is equivalent to
#pragma omp parallel ...
{
#pragma omp for
for (...)
;
}
When is loop parallelisation possible? #
For loop parallelism to be possible, these loops must obey a number of constraints, both in the form of the loop construct, and also in what the loop body contains. Formally, we need the loop to be of the form.
for (var = init; var logical_op end; incr_expr)
...
Where the logical_op
is one of <
, <=
, >
, or >=
and
incr_expr
is an increment expression like var = var + incr
(or
similar). We are also not allowed to modify var
in the loop body.
The major constraint on the loop body is that we cannot have dependencies between loop iterations. Conceptually, a parallel loop runs multiple iterations of the loop body statements in parallel. For this to be valid, we must be able to execute the statements in any order we like.
For example, this loop cannot be straightforwardly parallelised
for (size_t = 1; i < N; i++)
a[i] = a[i-1] + a[i];
Since the $i$th iteration depends on the result of the $i-1$th iteration.
Exercise
Write out the unrolled loop (unrolling by 4) and convince yourself that you can’t reorder the statements in the loop body while maintaining the same semantics.
This particular loop exhibits a read-after-write dependency, some times called a flow dependency. These are “true” dependencies and really inhibit parallelisation. There are also a number of other types, write-after-read (also called anti-dependencies), write-after-write (also called output dependencies), and read-after-read (not really dependencies). As usual, wikipedia has a good summary. For our purposes, read-after-write are the difficult ones to handle. The others can usually be refactored by introducing some temporary variables (as discussed in the linked wikipedia article). Typically, they then might reveal a read-after-write dependency.
The compiler will complain if your looping construct has the wrong form, however, it will not complain if your loop body is not suitable for parallelisation (e.g. it has data dependencies).
Exercise
Try compiling and running the bad loop in the code below
Do you always get the same results on different numbers of threads? What is wrong with the parallel loop?
Solution
We don’t see the same values printed independent of the number of threads, indicating that we did something wrong.
The reason is that the loop body modifies the iteration variable
i
. This is explicitly not allowed by the OpenMP standard (although the compiler does not complain), because if we modify the iteration variable, the compiler cannot figure out how to hand out the loops between threads.
#include <omp.h>
#include <stdio.h>
int main(void)
{
int N = 16;
int a[N];
for (int i = 0; i < N; i++) {
/* Sentinel for unhandled value */
a[i] = -1;
}
#pragma omp parallel for default(none) shared(a, N)
for (int i = 0; i < N; i++) {
a[i] = i;
i++;
}
for (int i = 0; i < N; i++) {
printf("a[%2d] = %2d\n", i, a[i]);
}
return 0;
}
Synchronisation #
There is no synchronisation at the start of a for
work-sharing
construct. By default, however, there is a synchronisation at the
end.
To remove the synchronisation at the end, we can use an additional
nowait
clause.
#pragma omp parallel
{
#pragma omp for
for (...) {
...;
} /* All threads synchronise here */
#pragma omp for nowait
for (...) {
...;
} /* No synchronisation in this case */
}
Doling out iterations #
The for
directive takes a number of additional clauses that allow us
to control how the loop iterations are divided between threads.
We do this with
#pragma omp for schedule(KIND[, CHUNKSIZE])
Where KIND
is one of
static
dynamic
guided
auto
runtime
and CHUNKSIZE
is an expression returning a positive integer.
If you don’t specify anything, this is equivalent to
#pragma omp parallel for schedule(static)
In most cases, the static schedule is the most useful. We’ll list the properties of all of the choices below.
static
schedules
#
schedule(static)
is a block schedule. We divide the total iterations
into (approximately) equal chunks, one for each thread in the team,
and assign the chunks in order to threads.
What if there are leftover iterations?
If specifying a chunk size (for example schedule(static, 3)
) then
we have a block cyclic schedule. Iterations are doled out to threads
in order chunksize
iterations at a time.
static
schedules deterministically allocate loop iterations to
threads. Two loops of the same length with the same static schedule
will dole out iterations in the same way.
dynamic
schedules
#
schedule(dynamic)
is a block cyclic schedule with a block size of
one. Iterations of the loop are doled out on a first-come-first-served
basis. That is as a thread becomes available, it is given the next
iteration.
schedule(dynamic, chunksize)
is the same, only now the block size is
specified by chunksize
.
dynamic
schedules do not provide deterministic allocation of
loop iterations to threads loop iterations to
threads. Two loops of the same length with the same dynamic schedule
will not necessarily dole out iterations in the same way.
guided
schedules
#
schedule(guided)
is a block cyclic schedule where the blocks start
out large and get exponentially smaller, until they reach a minimum
size of one iteration. The size of a block is proportional to the
number of remaining iterations divided by the number of threads.
schedule(guided, chunksize)
is the same, only now the minimum size
of the chunks is specified by chunksize
.
Chunks are doled out to threads on a first-come-first-served basis
(like the dynamic
schedule).
auto
and runtime
schedules
#
The auto
schedule leaves it up to the OpenMP runtime to decide how
to dole out iterations. It may implement some auto-tuning framework
where it measures performance of the loop with a given schedule and
then adjusts that in some clever way. Probably, the implementation
does not do anything clever.
The runtime
schedule allows you to specify the schedule to use at
runtime (rather than compile time) by setting the OMP_SCHEDULE
environment variable to a valid schedule string. For example
#pragma omp for schedule(runtime)
...
$ gcc -fopenmp -o code code.c
$ export OMP_SCHEDULE="static, 5"
$ OMP_NUM_THREADS=4 ./code # uses "static, 5" as the schedule
This is kind of pointless except for some small experiments because
there is only one value of the OMP_SCHEDULE
variable, so all
schedule(runtime)
loops must use the same value.
Exercise
You now know enough (probably more than enough) to attempt the OpenMP loop exercise.
Controlling execution in a parallel region #
Sometimes, we might want to serialise some part of the code in a
parallel region. OpenMP offers us two ways of doing this. If we don’t
care about which thread executes some code, we can use #pragma omp single
#pragma omp parallel
{
...; /* Some stuff in parallel */
#pragma omp single
{
...; /* Only one thread should do this */
} /* All threads synchronise here unless nowait is specified. */
}
For example, we could use this construct to read in an input file after some parallel setup on a single thread.
Alternatively, if we want thread0 to execute something, we can use
#pragma omp master
#pragma omp parallel
{
...; /* Some stuff in parallel */
#pragma omp master
{
...; /* Only thread0 does this */
} /* No synchronisation */
}
/* Kind of pointless, because it's equivalent to */
#pragma omp parallel
{
...;
if (omp_get_thread_num() == 0) {
...;
}
}
Summary #
OpenMP provides a number of constructs for distributing work (in the
form of loop iterations) between threads in a team. This is achieved
with the #pragma omp for
directive (inside an existing parallel
region).
Completely general for-loops are not allowed, because the runtime must be able to determine the number of iterations (and how to get from one iteration to the next) without executing the loop body.
There are a few ways of controlling how the iterations of the loop are
allocated to threads. These are called schedule
s. Generally a
static
schedule is best.
Next we’ll look at how threads can communicate with one another, so that we can do slightly more complicated things than just share loop iterations.
-
Counterpoint, if you write large libraries using OpenMP you will probably end up managing the loop parallelism and distribution by hand in this way anyway. ↩︎