One can blur or smooth the edges of an image by convolving the image
with a normalised box
kernel. Every output pixel \( g_{k, l} \) is created from the mean
of the input image pixel \(f _{k, l}\) and its eight neighbours.
This can be implemented by a loop over every pixel of the image,
accessing some a small stencil of data.
This computational pattern appears in both image processing and finite
difference discretisations of partial differential equations (there is
more on the computational aspects of this in
COMP52315, and the numerics in
COMP52215
if you’re interested).
The code and images are in the repository in the
code/blur_image/ subdirectory.
$ cd code/blur_image
$ ls
images openmp vec
The images directory contains images in
PPM format for that will serve
as input to our program. We’re going to be working in the openmp subdirectory.
$ cd openmp
$ ls
Makefile filters.c io.c main.c proto.h
There is some source code and a
Makefile that provides a
recipe for how to build the executable. It is just a text file and can
be edited with your favourite text editor. By running make you build
the executable.
Before we do this, we’ll have to load the right compiler modules.
We’ll use the intel compiler for this exercise, since it produces
better reports than gcc.
intel/xe_2018.2
gcc/9.3.0
intel_comp/2018
To confirm that everything works, run the code on one of the input
images to blur it.
The code is not yet parallelised. You should parallelise the
blur_mean function in filters.c, using OpenMP.
What kind of parallelisation is appropriate here? What schedule should
you use?
Solution
The main work is done in a for loop over the output pixels. Since the
output pixels are all independent, we can use a simple #pragma omp parallel for with a static schedule.
code/blur_image/openmp/filters-solution.c implements this scheme.
The code reports when it is running with OpenMP enabled. If you do not
see this, even after parallelisation, check that you enabled the
relevant compiler flags in the Makefile.
Having successfully parallelised the loop we’ll look at the parallel
scaling of the problem.
Question
What type of scaling is the appropriate one to consider here?
Solution
Since the total amount of work is fixed, strong scaling is appropriate. We are interested in how
fast we can produce the final image as we add more processes (to the
same size problem).
Exercise
Investigate the parallel scaling by running the code in parallel using
different numbers of threads
Hint
Remember that you control the number of threads by setting the
OMP_NUM_THREADS environment variable. Don’t forget to do this in
your submission script.
Use 1, 2, 4, 6, 8, 16, 24, 32, and 48 threads.
Use 1, 2, 4, 6, 8, 16, and 32 threads.
Produce a plot of your results comparing the observed speedup to an
ideal (linear speedup). What do you observe?
To make a problem that runs for a reasonable amount of time you
probably need to use the large sample image (landscape.ppm). You may
also need to increase the size of the blur filter from the default
n=1 (edit main.c to do this).
Solution
I used a static schedule. I get slightly different speedup behaviour
with n=1 to n=10, which the graph below shows.
To grab the timing data in a loop, I just did the following
for n in 12468162432 48;doOMP_NUM_THREADS=$n ./blur ../images/landscape.ppm output.ppm;done| grep Blurring | cut -f 2 -d :
Then I manually copied and plotted with matplotlib.
Strong scaling (speedup) for the OpenMP parallel image blurring code.
Note that with $n=10$, the overall time is much longer than with
$n=1$.
A Hamilton compute node has only 24 cores, so adding more than 24
threads does not help (indeed it harms). For small $n$, more than 8
threads does not really help. I think this is because the memory
bandwidth is maxed out.
Exercise
Having produce a scaling plot for the loop schedule you selected, try repeating
the experiments with different loop schedules. For example, if you
used a static schedule, try with a dynamic or guided schedule.
What do you observe?
Can you explain your results thinking about whether the computational
cost is variable depending on which pixel in the image you are
blurring?
Solution
I run the main loop with schedule(runtime) to control the schedule
and do
for schedule in static static,100 dynamic dynamic,100 guided guided,100;doecho$schedulefor n in 1246816 24;doOMP_SCHEDULE=$scheduleOMP_NUM_THREADS=$n ./blur ../images/landscape.ppm output.ppm;done| grep Blurring | cut -f 2 -d :
done
This time I only ran with up to 24 threads, since we already
determined that more than that number is not very helpful. I also only
used $n=1$ for this case
I see the following scaling
Strong scaling (speedup) for different schedules.
This time it looks like the guided schedule is best, but this might be
misleading, since each line is normalised to itself.
We can replot these data, producing the speedup curve relative to the
best single-thread performance over all schedules. Let’s see what that
looks like
Strong scaling relative to best single-thread schedule (speedup) for different schedules.
In this case, it looks like the static schedule is now a bit worse. I
am not sure exactly what is going on, and one would need to do more
detailed investigation and profiling to find out.