What is OpenMP?

What is OpenMP #

OpenMP is a standardised API for programming shared memory computers (and more recently GPUs) using threading as the programming paradigm. It supports both data-parallel shared memory programming (typically for parallelising loops) and task parallelism. We’ll see some examples later.

In recent years, it has also gained support for some vector-based parallelism.

Using OpenMP #

OpenMP is implemented as a set of extensions for C, C++, and Fortran. These extensions come in three parts

  1. #pragma-based directives;
  2. runtime library routines;
  3. environment variables for controlling runtime behaviour.

OpenMP is an explicit model of parallel programming. It is your job, as the programmer, to decide where and how to employ parallelism.

Directives #

We already saw some directives when discussing vectorisation. In that case, we saw compiler-specific directives. In the case of OpenMP, since it is a standard, the meaning of the directive is the same independent of the compiler choice1.

All OpenMP directives start with #pragma omp . They are therefore ignored if the code is compiled without adding some special compiler flags.

-qopenmp
-fopenmp

We can parallelise a loop like so

#pragma omp parallel for
for (int i = 0; i < N; i++)
  ...

Library routines #

In addition to the directives, which are used to enable parallelism through annotation, the OpenMP standard also provides for a number of runtime API calls. These allow threads to inspect the state of the program (for example to ask which thread they are). To do this, we must include a C header file omp.h. All OpenMP API calls are prefixed with omp_.

#include <omp.h>

...
#pragma omp parallel
{
  /* Which thread am I? */
  int threadid = omp_get_thread_num();
  /* How many threads are currently executing */
  int nthread = omp_get_num_threads();
}

Unlike the pragmas, these runtime calls are only available when compiling with OpenMP enabled2.

Environment variables #

The primary variable is OMP_NUM_THREADS which specifies the number of threads that should be available to the program when running.

This is the number of threads created when entering a parallel region.

hello/openmp.c
#include <stdio.h>
#include <omp.h>

int main(void)
{
  int nthread = omp_get_max_threads();
  int thread;
#pragma omp parallel private(thread) shared(nthread)
  {
    thread = omp_get_thread_num();
    printf("Hello, World! I am thread %d of %d\n", thread, nthread);
  }
  return 0;
}
$ icc -qopenmp -o hello openmp.c
$ export OMP_NUM_THREADS=1
$ ./hello
Hello, World! I am thread 0 of 1
$ export OMP_NUM_THREADS=8
$ ./hello
Hello, World! I am thread 5 of 8
Hello, World! I am thread 3 of 8
Hello, World! I am thread 6 of 8
Hello, World! I am thread 1 of 8
Hello, World! I am thread 0 of 8
Hello, World! I am thread 2 of 8
Hello, World! I am thread 4 of 8
Hello, World! I am thread 7 of 8

Exercise

Try this yourself. For instructions you can also see the Hello World exercise.

What do you observe if you never set OMP_NUM_THREADS?

Hint
Use unset OMP_NUM_THREADS to ensure there is no existing value of the variable.

The number of threads used by an OpenMP program if you do not set OMP_NUM_THREADS explicitly is up to the particular implementation.

I therefore recommend that you always set OMP_NUM_THREADS explicitly.

Parallel constructs. #

The basic parallel construct is a parallel region. This is introduced with #pragma omp parallel

/* Some serial code on thread0 */
#pragma omp parallel /* Extra threads created */
{
  /* This code is executed in parallel by all threads. */
  ...;
} /* Synchronisation waiting for all threads */
/* More serial code on thread0 */

The program begins by executing using a single thread (commonly termed the master thread, though we’ll use the phrase “thread0”). When a parallel region is encounter, a number of additional threads (called a “team”) are created. These threads all execute the code inside the parallel region, there is then a synchronisation point at the end of the region, after which thread0 continues execution of the next (serial) statements.

This is a fork/join programming model, and is therefore best analysed using the bulk synchronous parallel (BSP) abstract model.

Schematic of fork-join parallelism. Single-threaded execution above (with two regions of parallel tasks). Parallel execution with fork-join points marked below.

Schematic of fork-join parallelism. Single-threaded execution above (with two regions of parallel tasks). Parallel execution with fork-join points marked below.

void foo(...)
{
  ...;
  #pragma omp parallel
  {
    parallel_code;
  } /* Synchronisation here */
  serial_code;
  #pragma omp parallel
  {
    more_parallel_code;
  } /* Synchronisation here */
}

Data scoping: shared and private #

Inside a parallel region, any variables that are in scope can either be shared or private. All threads see the same (single) copy of any shared variables and can read and write to them. Private variables are individual to each thread: there is one copy of the variable per thread. These variables are not visible to any other threads in the region, and can thus only be read and written by their own thread.

We declare the visibility of variables by providing extra clauses to the parallel directive

openmp-snippets/parallel-region.c
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
void foo(double *a, int N)
{
  int i;
#pragma omp parallel shared(a, N) private(i)
  {
    int j; /* This variable is local to the block (and hence private) */

    /* Each thread has its own copy of i. */
    i = omp_get_thread_num();
    j = i;
    if (i%2 == 0) {
      /* Fake "delay" of some threads. */
      usleep(10);
    }
    /* All threads write to the same a and read the same N. */
    if (j < N) a[j] = i;
  }
}

int main(int argc, char **argv)
{
  int N = 32;
  double a[32];

  for (int i = 0; i < N; i++) {
    a[i] = -1;
  }

  foo(a, N);

  for (int i = 0; i < N; i++) {
    if (a[i] > -1) {
      printf("a[%2d] = %g\n", i, a[i]);
    }
  }
}

Exercise

Compile and run this code with a number of different threads.

Convince yourself you understand what is going on.

Default data scoping #

If we have a lot of variables that we want to access in the parallel region, it is slightly tedious to explicitly list them.

Consider, for example, the following snippet

double *a;
int N;
int i, j;
#pragma omp parallel shared(a) private(i)
{
  /* a is shared, i is private (explicitly) */
  /* N and j are also in scope, so they must be either shared or private. */
}

The OpenMP design board took the decision that there should be a default scoping rule for variables.

This was a terrible decision.

Neither choice is particularly satisfying, but the default, in the absence of specification, is that variables are shared.

This is bad, because if we write seemingly innocuous code, it can behave badly when run in parallel.

Try compiling and running this code, very similar to the parallel-region example we saw above

openmp-snippets/parallel-region-bad.c
#include <omp.h>
#include <stdio.h>
#include <unistd.h>
void foo(double *a, int N)
{
  int i;
#pragma omp parallel shared(a, N)
  {
    int j; /* This variable is local to the block (and hence private) */
    i = omp_get_thread_num();
    j = i;
    if (i%2 == 0) {
      /* Fake "delay" of some threads. */
      usleep(10);
    }
    /* All threads write to the same a and read the same N. */
    if (j < N) a[j] = i;
  }
}

int main(int argc, char **argv)
{
  int N = 32;
  double a[32];

  for (int i = 0; i < N; i++) {
    a[i] = -1;
  }

  foo(a, N);

  for (int i = 0; i < N; i++) {
    if (a[i] > -1) {
      printf("a[%2d] = %g\n", i, a[i]);
    }
  }
}

Exercise

Compile and run the code using 8 threads a number of times. What do you observe about the output?

Can you explain what is happening?

Hint
Think about the potential data races.
Solution

If I run this code on eight processes, I see:

$ OMP_NUM_THREADS=8 ./bad-region
a[ 0] = 7
a[ 1] = 1
a[ 2] = 7
a[ 3] = 3
a[ 4] = 7
a[ 5] = 5
a[ 6] = 6
a[ 7] = 7

Although sometimes the values change.

What is happening is that the i variable which records the thread number in the parallel region is shared (rather than being private). So by the time we get to the point where we write it into the output array, it is probably overwritten by a value from another thread.

Fortunately, there is a way to negate this bad design decision. The default clause. If we write

#pragma omp parallel default(none)
{
}

Then we must explictly provide the scoping rules for all variables we use in the parallel region. This forces us to think about what the right scope should be.

My recommendation is to always use default(none) in parallel directives. It might start out tedious, but it will save you many subtle bugs!

Values on entry #

Shared variables take inherit the value outside the parallel region inside. Private variables are uninitialised.

openmp-snippets/uninitialised.c
#include <omp.h>
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
  int b = 42;
  double *a = malloc(100*sizeof(*a));

#pragma omp parallel default(none) shared(a) private(b)
  {
    a[omp_get_thread_num()] = 2;

    printf("Thread=%d; b=%d\n", omp_get_thread_num(), b);
  }
  free(a);
  return 0;
}

At least some compilers will warn about this.

$ gcc-10 -Wall -Wextra -o uninitialised uninitialised.c -fopenmp
uninitialised.c: In function 'main._omp_fn.0':
uninitialised.c:13:5: warning: 'b' is used uninitialized in this function [-Wuninitialize]
   13 |     printf("%d %d\n", omp_get_thread_num(), b);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
uninitialised.c:6:7: note: 'b' was declared here
    6 |   int b = 42;
      |       ^

Running the code produces some (possibly surprising) results.

$ OMP_NUM_THREADS=8 ./uninitialised
4 32642
1 32642
3 32642
2 32642
0 0
6 32642
7 32642
5 32642

Exercise

If you do this, do you always see the same nonsense values? Does it depend on the compiler?

Solution
I, at least, don’t always see the same values. Although it seems for me, thread0 always gets initialised to zero.

If you really need a private variable that takes its initial value from the surrounding scope you can use the firstprivate clause. But it is rare that this is necessary.

int b = 23;
double *a = malloc(4*sizeof(*a));
#pragma omp parallel default(none) firstprivate(b) shared(a)
{
  int i = omp_get_thread_num();
  if (i < 4) {
    a[i] = b + i;
  }
}

Summary #

OpenMP provides a directives + runtime library approach to shared memory parallelism using threads. It uses the fork/join model of parallel execution. Threads can diverge in control-flow, and can either share variables, or have their own copies.

With the ability to create threads and have a unique identifier for each thread, it is possible to program a lot of parallel patterns “by hand”. For example, dividing loop iterations up between threads.

This is rather hard work, and OpenMP provides a number of additional parallelisation directives which we will look at next that help with parallelisation of loops


  1. Modulo bugs in the compiler’s implementation. ↩︎

  2. The Intel compiler supports a “stub” library that you can use with -qopemp-stubs if you want to compile in serial. ↩︎