Parallelisation of a simple stencil #
We’ll be running these exercises on Hamilton or COSMA, so remind yourself of how to log in and transfer code if you need to.
Blurring an image #
One can blur or smooth the edges of an image by convolving the image with a normalised box kernel. Every output pixel \( g_{k, l} \) is created from the mean of the input image pixel \(f _{k, l}\) and its eight neighbours.
$$ g_{k,l} = \frac{1}{9} \begin{bmatrix} 1 & 1 & 1\\ 1 & 1 & 1\\ 1&1&1 \end{bmatrix} * f_{k, l} $$
This can be implemented by a loop over every pixel of the image, accessing some a small stencil of data.
This computational pattern appears in both image processing and finite difference discretisations of partial differential equations (there is more on the computational aspects of this in COMP52315, and the numerics in COMP52215 if you’re interested).
The code and images are in the repository in the
code/blur_image/
subdirectory.
$ cd code/blur_image
$ ls
images openmp vec
The images
directory contains images in
PPM format for that will serve
as input to our program. We’re going to be working in the openmp
subdirectory.
$ cd openmp
$ ls
Makefile filters.c io.c main.c proto.h
There is some source code and a
Makefile
that provides a
recipe for how to build the executable. It is just a text file and can
be edited with your favourite text editor. By running make
you build
the executable.
Before we do this, we’ll have to load the right compiler modules. We’ll use the intel compiler for this exercise, since it produces better reports than gcc.
intel/xe_2018.2
gcc/9.3.0
intel_comp/2018
To confirm that everything works, run the code on one of the input images to blur it.
Parallelisation #
The code is not yet parallelised. You should parallelise the
blur_mean
function in filters.c
, using OpenMP.
What kind of parallelisation is appropriate here? What schedule should you use?
Solution
#pragma omp parallel for
with a static schedule.
code/blur_image/openmp/filters-solution.c
implements this scheme.
The code reports when it is running with OpenMP enabled. If you do not
see this, even after parallelisation, check that you enabled the
relevant compiler flags in the Makefile
.
Parallel scaling #
Having successfully parallelised the loop we’ll look at the parallel scaling of the problem.
Question
What type of scaling is the appropriate one to consider here?
Solution
Since the total amount of work is fixed, strong scaling is appropriate. We are interested in how fast we can produce the final image as we add more processes (to the same size problem).
Exercise
Investigate the parallel scaling by running the code in parallel using different numbers of threads
Hint
Remember that you control the number of threads by setting theOMP_NUM_THREADS
environment variable. Don’t forget to do this in your submission script.Use 1, 2, 4, 6, 8, 16, 24, 32, and 48 threads.Use 1, 2, 4, 6, 8, 16, and 32 threads.Produce a plot of your results comparing the observed speedup to an ideal (linear speedup). What do you observe?
To make a problem that runs for a reasonable amount of time you probably need to use the large sample image (
landscape.ppm
). You may also need to increase the size of the blur filter from the defaultn=1
(editmain.c
to do this).Solution
I used a static schedule. I get slightly different speedup behaviour with
n=1
ton=10
, which the graph below shows.To grab the timing data in a loop, I just did the following
for n in 1 2 4 6 8 16 24 32 48; do OMP_NUM_THREADS=$n ./blur ../images/landscape.ppm output.ppm; done | grep Blurring | cut -f 2 -d :
Then I manually copied and plotted with matplotlib.
Note that with $n=10$, the overall time is much longer than with $n=1$.
A Hamilton compute node has only 24 cores, so adding more than 24 threads does not help (indeed it harms). For small $n$, more than 8 threads does not really help. I think this is because the memory bandwidth is maxed out.
Exercise
Having produce a scaling plot for the loop schedule you selected, try repeating the experiments with different loop schedules. For example, if you used a
static
schedule, try with adynamic
orguided
schedule.What do you observe?
Can you explain your results thinking about whether the computational cost is variable depending on which pixel in the image you are blurring?
Solution
I run the main loop with
schedule(runtime)
to control the schedule and dofor schedule in static static,100 dynamic dynamic,100 guided guided,100; do echo $schedule for n in 1 2 4 6 8 16 24; do OMP_SCHEDULE=$schedule OMP_NUM_THREADS=$n ./blur ../images/landscape.ppm output.ppm; done | grep Blurring | cut -f 2 -d : done
This time I only ran with up to 24 threads, since we already determined that more than that number is not very helpful. I also only used $n=1$ for this case
I see the following scaling
This time it looks like the guided schedule is best, but this might be misleading, since each line is normalised to itself.
We can replot these data, producing the speedup curve relative to the best single-thread performance over all schedules. Let’s see what that looks like
In this case, it looks like the static schedule is now a bit worse. I am not sure exactly what is going on, and one would need to do more detailed investigation and profiling to find out.