Simple MPI parallelism #
In this exercise we’re going to compute an approximation to the value of π using a simple Monte Carlo method. We do this by noticing that if we randomly throw darts at a square, the fraction of the time they will fall within the incircle approaches π.
Consider a square with side-length \(2r\) and an inscribed circle with radius \(r\).
The ratio of areas is
$$ \frac{A_\text{circle}}{A_\text{square}} = \frac{\pi r^2}{4 r^2} = \frac{\pi}{4}. $$
If we therefore draw \(X\) uniformly at random from the distribution \( \mathcal{U}(0, r) \times \mathcal{U}(0, r)\), then the probability that \(X\) is in the circle is
$$ p_\text{in} = \frac{\pi}{4}. $$
We can therefore approximate π by picking \(N_\text{total}\) random points and counting the number, \(N_\text{circle}\), that fall within the circle
$$ \pi_\text{numerical} = 4 \frac{N_\text{circle}}{N_\text{total}} $$
Obtaining the code #
The code for this exercise is in the repository in the
code/calculate_pi/
subdirectory.
Working from the repository
main
branch again and create a new
branch for this exercise.
The code contains two subdirectories. We’ll be working in the serial
subdirectory.
$ cd code/calculate_pi/serial
$ ls
Makefile calculate_pi.c main.c proto.h
Load the relevant Intel compiler modules and then build the code with
make
. It can be run with ./calc_pi N
where N
is the total number
of random points.
Convergence #
Exercise
Run the code for different choices of
N
and plot the error as a function ofN
.What relationship do you observe between the accuracy of the approximate result and
N
?Hint
You can execute a sequence of commands in your batch script (just make sure to allocate enough time in the queue for them all), rather than running each one in its own job.Solution
I get, as expected, approximately $\sqrt{N}$ convergence.
Parallelisation with MPI #
We’re now going to parallelise this computation with MPI. The first thing to do is to load the appropriate module to gain access to the MPI compilers.
$ module load intelmpi/intel/2018.2
$ module load intel_mpi/2018
Initialising MPI #
Remember that the first thing every MPI program should do is to
initialise MPI with
MPI_Init
and the final thing they should do is finalise MPI with
MPI_Finalize
.
So do that in the main
function (in main.c
).
Exercise
Add code in
main
to determine the size ofMPI_COMM_WORLD
as well as the rank of the current process, and then print the results out on every process.Now compile and then run the code with two processes using
mpirun
. Does what you observe make sense?Solution
At this point, your code just runs the same calculations on multiple processes (with the same random numbers). So you should have seen multiple prints of the same output.
Parallelising the random number generation #
So far all we’ve done is take a serial program and parallelise it by running the same calculations multiple times. This is not very useful.
The first thing to do is to ensure that the different processes use
different random numbers. These are generated using the C standard
library’s pseudo-random number generator. The initial state is
seeded in calculate_pi
.
Exercise
Modify the code so that the seed depends on which process is generating the random numbers.
Hint
Therank
of a process is a unique identifier.Run again on two processes, do you now see that the results are different depending on the process?
Solution
If you change the callsrand(42)
tosrand(rank)
then different process will produce different random numbers. Now when running in parallel you should see (slightly) different results on the different processes.
Note: parallel random numbers
The approach we use here does not produce statistically uncorrelated random number streams. This does not really matter for our didactic purposes, it just means that the effective number of Monte Carlo samples is lower than what we specify.
If you need truly independent random number streams, then the different approaches described here give more information on how to achieve it.
Dividing the work #
Now we have different processes using differents seeds we need to
divide up the work such that we take N
samples in total (rather than
N
samples on each process).
Modify the calculate_pi
function such that the samples are
(reasonably) evenly divided between all the processes. After this
you’re producing a partial result on every process.
Finally, combine these partial results to produce the full result by summing the number of points found to be in the circle across all processes.
Hint
Remember that you wrote a function in the ring reduction exercise to add up partial values from all the processes.
Alternately, you may find the function
MPI_Allreduce
useful.
Question
Test your code running on 1, 2, 4, 8, and 16 cores. Produce a plot of the runtime as a function of the number of cores.
What observations can you make?
Solution
If you did not manage, or you want to compare with a different implementation, the directory
code/calculate_pi/mpi
contains a parallel implementation.You should see that the runtime decreases almost linearly with the number of additional processes.
When I do this, I observe the following plot
This is sort of expected because there’s no communication and the major bottleneck is how fast the random points can be generated.
Advice when writing MPI programs #
It is good practice when using MPI that every function which is collective explicitly receives as an argument the communicator.
If you find yourself explicitly referring to one of the default
communicators (e.g. MPI_COMM_WORLD
) in a function, think about how
to redesign the code so that you pass the communicator as an argument.
For example, rather than writing something like:
void divide_work(int N, int *nlocal)
{
int size, rank;
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
/* Hand out evenly sized chunks with the leftover
* (N - (N/size)*size) being evenly split between higher-numbered
* processes */
*nlocal = N / size + ((N % size) > rank);
}
Instead prefer
void divide_work(MPI_Comm comm, int N, int *nlocal)
{
int size, rank;
MPI_Comm_size(comm, &size);
MPI_Comm_rank(comm, &rank);
/* Hand out evenly sized chunks with the leftover
* (N - (N/size)*size) being evenly split between higher-numbered
* processes */
*nlocal = N / size + ((N % size) > rank);
}
This way, if your code is changed to use different communicators, it will work transparently.