Collectives

OpenMP collectives #

So far we’ve seen how we can create thread teams using #pragma omp parallel and distribute work in loops between members of the team by using #pragma omp for.

Now we’ll look at what we need to do if we need to communicate between threads.

Reductions #

Remember that the OpenMP programming model allows communication between threads by using shared memory. If some piece of memory is shared in a parallel region then every thread in the team can both read and write to it.

Recall also that there is no synchronisation across accesses to shared memory. Let us consider what this means for computing a vector dot product

$$ a \cdot b = \sum_i^N a_i b_i $$

A serial loop to compute this code might look like the following

dot = 0;
for (int i = 0; i < N; i++) {
  dot += a[i]*b[i];
}

A naive parallelisation would just annotate the loop with #pragma omp for.

dot = 0;
#pragma omp parallel for default(none) shared(dot, a, b) schedule(static)
for (int i = 0; i < N; i++) {
  dot += a[i]*b[i];
}

However, this has a problem. The shared variable dot is updated by multiple threads, and so we are not guaranteed that all increments will be seen since there is a write race in the increment of dot.

I discuss some tools for detecting data races below.

We can test this out with the following code

openmp-snippets/reduction-race.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
  const int N = 1024;
  int *a = malloc(N * sizeof(*a));
  int *b = malloc(N * sizeof(*b));
  /* Intialise with some values */
  for (int i = 0; i < N; i++) {
    a[i] = i+1;
    b[i] = i*2;
  }
  /* Check */
  int dotabserial = 0;
  for (int i = 0; i < N; i++) {
    dotabserial += a[i]*b[i];
  }

  printf("  Serial a.b = %d\n", dotabserial);

  int dotabparallel = 0;
#pragma omp parallel default(none) shared(a, b, N, dotabparallel)
  {
#pragma omp for schedule(static)
    for (int i = 0; i < N; i++) {
      dotabparallel += a[i]*b[i];
    }
  }
  printf("Parallel a.b = %d\n", dotabparallel);
  free(a);
  free(b);
  return 0;
}

Exercise

This example code computes a dot product in serial and then in parallel. But the parallel version has race conditions.

Try running with different numbers of threads. Do you always get the correct answer in parallel? Do you always get the same wrong answer?

Solution
I, at least, don’t always get the same wrong answer (but I generally get the wrong answer).

The solution to this problem is to create partial sums on each thread, and the accumulate them in a thread-safe way. We could do this like so

openmp-snippets/reduction-hand.c
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int main(void)
{
  const int N = 1024;
  int *a = malloc(N * sizeof(*a));
  int *b = malloc(N * sizeof(*b));
  /* Intialise with some values */
  for (int i = 0; i < N; i++) {
    a[i] = i+1;
    b[i] = i*2;
  }
  /* Check */
  int dotabserial = 0;
  for (int i = 0; i < N; i++) {
    dotabserial += a[i]*b[i];
  }

  printf("  Serial a.b = %d\n", dotabserial);

  int dotabparallel = 0;
  int *dotlocal = NULL;
#pragma omp parallel default(none) shared(a, b, N, dotabparallel, dotlocal)
  {
    int tid = omp_get_thread_num();
#pragma omp single
    dotlocal = calloc(omp_get_num_threads(), sizeof(*dotlocal));
    /* Implicit barrier at end of single is required so that dotlocal
       is defined for the loop */
#pragma omp for schedule(static)
    for (int i = 0; i < N; i++) {
      dotlocal[tid] += a[i]*b[i];
    }
    /* Implicit barrier at end of for is required here so that a
       thread finishing early does not update dotabparallel too soon. */
#pragma omp single nowait
    for (int i = 0; i < omp_get_num_threads(); i++) {
      dotabparallel += dotlocal[i];
    }
  }
  printf("Parallel a.b = %d\n", dotabparallel);
  free(dotlocal);
  free(a);
  free(b);
  return 0;
}

As the comments indicate, all the barriers are quite delicate.

Exercise

Run this code, check that it continues to give you the correct answer independent of the number of threads you use.

Now explore what happens if you accidentally got some of the barriers wrong.

  1. What happens if you add nowait to the single directive on line 28?
  2. What happens if you add nowait to the for directive on line 32? (Remove the nowait from line 28 again!)
Solution

For 1., you probably get a Segmentation fault, because we need to wait for the array to be allocated, this means we need to synchronise after the allocation.

For 2., you probably get the wrong answer. The reason is that if one thread finishes in the parallel loop before the others, it will immediately go ahead and start adding up the contributions, so it’ll pick up whatever happens to be the current value in the other entries in dotlocal (which will likely be incomplete).

Directives to the rescue #

We see that writing a collective reduction by hand is possible, but a bit tedious. OpenMP provides facilities to handle all of the gory details by adding an extra reduction clause to the for directive

#pragma omp for schedule(static) reduction(+:dot)
for (int i = 0; i < N; i++) {
  dot += a[i]*b[i];
}

This tells OpenMP that dot is a reduction variable, and that the combining operation is +. It now becomes the compiler’s job to generate appropriate code for privatising the partial reduction contributions from different threads and then combining them.

reduction clauses have some restrictions on the type of data they can combine. We need an associative binary operator (so that the combination can happen in any order) and an identity element for the operation (so that the compiler can generate the right initial value for the privatised reduction variables).

Reductions are defined for builtin datatypes (such as int, double) and for the combining operations in the table below

Operation Operator Initialisation
Addition + 0
Multiplication * 1
Subtraction1 - 0
Minimum min Most positive number of given type
Maximum max Most negative number of given type
Logical and && 1
Logical or `
Bitwise and & ~0 (all bits on)
Bitwise or ` `

With this, we can rewrite our dot product example

openmp-snippets/reduction-directive.c
#include <stdio.h>
#include <stdlib.h>

int main(void)
{
  const int N = 1024;

  int *a = malloc(N * sizeof(*a));
  int *b = malloc(N * sizeof(*b));
  /* Intialise with some values */
  for (int i = 0; i < N; i++) {
    a[i] = (i+1) * (2 - (i % 5));
    b[i] = i*2;
  }
  /* Check */
  int dotabserial = 0;
  for (int i = 0; i < N; i++) {
    dotabserial += a[i]*b[i];
  }

  printf("  Serial a.b = %d\n", dotabserial);
  int dotabparallel = 0;
#pragma omp parallel for schedule(static) default(none) \
  shared(a, b, N) reduction(+:dotabparallel)
  for (int i = 0; i < N; i++) {
    dotabparallel += a[i]*b[i];
  }
  printf("Parallel a.b = %d\n", dotabparallel);
  free(a);
  free(b);
  return 0;
}

This is both more succinct, and potentially better optimised. For example, the compiler might implement a tree reduction (rather than the linear reduction I coded above).

Where can you use reduction clauses? #

You don’t need to annotate a loop with a reduction. Suppose that you have a parallel region where the threads each do a single expensive operation and then combine the results. A reduction is entirely appropriate here too.

Inter-thread synchronisation #

Sometimes, we’ll want to do something other than just compute a reduction. In these cases, if we want to pass information between threads, we need to synchronise on reading and writing to shared memory.

Barriers #

OpenMP has a number of constructs for this purpose. We’ve already implicitly seen one of them, namely barriers. In a parallel region, #pragma omp barrier can be used to synchronise all threads in the team.

#pragma omp parallel for shared(a) default(none)
{
  int tid = omp_get_thread_num();
  int nthread = omp_get_num_threads();
  /* Produce some thread-specific data */
  a[tid] = some_computation(tid, ...);
  #pragma omp barrier
  /* Read data from a neighbouring thread */
  b[tid] = a[(tid+1)%nthread] + ...;
}

Without the barrier, there is no synchronisation between the writes to a and the reads when updating b. We would therefore likely get the wrong answer.

The barrier ensures that no thread attempts the update to b before all threads have arrived at the barrier (after updating a).

Either all threads must encounter the barrier, or none, otherwise we get a deadlock.
openmp-snippets/bad-barrier.c
#include <omp.h>
#include <stdio.h>


int main(void)
{
  const int N = 16;
  int a[N];
  for (int i = 0; i < N; i++) {
    a[i] = -1;
  }
#pragma omp parallel default(none) shared(a, N)
  {
    int tid = omp_get_thread_num();
    int nthread = omp_get_num_threads();

    a[tid] = tid + nthread;

    if (tid % 2 == 0) {
      /* deadlock! */
      #pragma omp barrier
    }
    a[tid] = a[(tid + 1)%nthread];
  }
  for (int i = 0; i < N; i++) {
    printf("a[%2d] = %2d\n", i, a[i]);
  }
  return 0;
}

Exercise

Try this deadlock situation out with the above code.

Does it run successfully with one thread? What about two or three?

Hint
To terminate the hanging program, type Control-c at the commandline.
Solution
It should work fine with one thread, but not more than one.

Recall that often barriers are implicit in worksharing constructs. So if we are in a parallel region we do not need a barrier between two #pragma omp for directives for synchronisation (because #pragma omp for has an implicit barrier at the end of the loop).

Critical sections and atomics #

Sometimes a barrier synchronisation is a little heavy-handed. For this OpenMP provides us two further constructions. For protecting code and variables respectively.

The first is the critical section #pragma omp critical. This directive specifies that a region of code must be executed by only one thread at a time.

Here’s a trivial example, counting the number of threads in a parallel region (don’t do this, use omp_get_num_threads()).

int nthread = 0;
#pragma omp parallel default(none) shared(nthread)
{
  #pragma omp critical
  {
    nthread += 1;
  }
}

Critical sections are more useful if you’re parallelising over a shared data structure.

For example, consider a shared task stack of work from which we can pop work and push work. In pseudo-code, this looks approximately like the below.

stack = ...;              /* Create initial work */
while (1) {
  task = pop(stack);      /* Get the next work item */
  if (task == NULL) {
    break;                /* No work left, exit loop */
  }
  newtask = work(task);   /* Do work, potentially producing a new task */
  if (newtask != NULL) {
    push(newtask, stack); /* If there's a new task, add it to the stack */
  }
}

We can parallelise this with OpenMP, but we need to make sure there are no race conditions when pushing and popping from the shared stack data structure. This can be achieved with critical sections

stack = ...;
#pragma omp parallel default(none) shared(stack)
{
  while (1) {
#pragma omp critical modifystack
    {
      task = pop(stack);
    }
    if (task == NULL) {
      break;
    }
    newtask = work(task);
    if (newtask != NULL) {
#pragma omp critical modifystack
      push(newtask, stack);
    }
  }
}

Here we protect the modification of the stack by critical sections.

Points to note:

  1. I gave the critical sections an optional name (modifystack). All critical sections with the same name synchronise. If no name is provided this matches any other critical section without a name.
  2. We need the critical sections to have the same name for the push and pop because both of these sections modify the same shared data structure.
  3. This probably isn’t a very good implementation because threads might exit the loop too early (if they pop from an empty stack before new work is pushed by another thread).
Design of high-performance datastructures for these kind of irregular computations is actually an ongoing area of research. If you’re interested, look at some of the work that the Galois team are doing.

Exercise

Modify the reduction-hand.c example to use a critical section to ensure the result is always correct.

Solution
This was actually the topic of the synchronisation exercise, so see the solutions there.

Atomics #

Even finer-grained than critical sections are #pragma omp atomic directives. These can be used to protect (some) updates to shared variables.

int nthread = 0;
#pragma omp parallel default(none) shared(nthread)
{
  #pragma omp atomic
    nthread += 1;
}

An atomic directive protects variables (not code, like critical sections). In particular, it protects the write to the variable on the left hand side of an assignment. The allowed form is one of

  1. x op= expr where op is one of +, -, *, /, &, ^, <<, or >>
  2. x++, ++x, x--, --x.

Note that the evaluation of expr in the first form is not protected, it is only the write to x that is protected.

Exercise

We can use these synchronisation constructs to implement different approaches to the reduction example. The openmp exercise on reductions does this.

Tools for detecting data races #

We need to be careful when writing parallel code that we do not accidentally introduce race conditions that produce incorrect results. There are some tools available to help with this. On Linux-based systems you can use helgrind.

Modern versions of GCC and Clang also offer a thread sanitizer mode enabled with -fsanitize=thread. Again, this is transparently supported on Linux, but seemingly not on MacOS.

Often, thinking hard is your best bet.

Sometimes these tools will report false positives, and you need to work a little bit to eliminate them. See this nice article for more information.

Exercise

Try compiling and running the reduction-race.c) example using GCC and thread sanitizer enabled. Run with two threads, does it help you to understand the race condition?

On Hamilton load the gcc/9.3.0 module, on COSMA load the gnu_comp/10.2.0 module. You should then compile with

$ gcc -fopenmp -g -fsanitize=thread -o race reduction-race.c

The -g adds debug symbols so that we see line numbers in the error reports.

Solution

If I do this and then run with two threads, I see output like the following:

WARNING: ThreadSanitizer: data race (pid=8685)
  Read of size 4 at 0x7ffcc8e32e54 by thread T1:
    #0 main._omp_fn.0 /ddn/home/vtdb72/phys52015/code/openmp-snippets/reduction-race.c:27 (foo+0x400d35)
    #1 gomp_thread_start ../../../gcc-9.3.0/libgomp/team.c:123 (libgomp.so.1+0x19ec5)

Previous write of size 4 at 0x7ffcc8e32e54 by main thread: #0 main._omp_fn.0 /ddn/home/vtdb72/phys52015/code/openmp-snippets/reduction-race.c:27 (foo+0x400d4f) #1 GOMP_parallel ../../../gcc-9.3.0/libgomp/parallel.c:171 (libgomp.so.1+0x10fc1) #2 __libc_start_main <null> (libc.so.6+0x22504)

Location is stack of main thread.

Location is global ‘<null>’ at 0x000000000000 ([stack]+0x00000001fe54)

Thread T1 (tid=8687, running) created by main thread at: #0 pthread_create ../../../../gcc-9.3.0/libsanitizer/tsan/tsan_interceptors.cc:964 (libtsan.so.0+0x2cd6b) #1 gomp_team_start ../../../gcc-9.3.0/libgomp/team.c:836 (libgomp.so.1+0x1a4e5) #2 __libc_start_main <null> (libc.so.6+0x22504)

This tells me that multiple threads had a write-race on line 27, which is where the dotabparallel variable is updated.

Summary #

As well as straightforward loop parallelisation, OpenMP also provides a number of constructs to help with accumulation of results and synchronisation when updating shared variables. The biggest hammer is a barrier, but these can often be avoided in favour of more fine-grained directives. The most useful is probably the reduction clause for loops.


  1. Yes, subtraction isn’t associative, so doesn’t give us a monoid. The behaviour of OpenMP is to treat this like +, sum all the partial results and then multiply by -1 at the end. ↩︎