In this exercise we’re going to compute an approximation to the value
of π using a simple Monte Carlo method. We do this by noticing that if
we randomly throw darts at a square, the fraction of the time they
will fall within the incircle approaches π.
Consider a square with side-length \(2r\) and an inscribed circle
with radius \(r\).
If we therefore draw \(X\) uniformly at random from the distribution
\( \mathcal{U}(0, r) \times \mathcal{U}(0, r)\), then the
probability that \(X\) is in the circle is
$$
p_\text{in} = \frac{\pi}{4}.
$$
We can therefore approximate π by picking \(N_\text{total}\) random
points and counting the number, \(N_\text{circle}\), that fall within the
circle
The code for this exercise is in the repository in the
code/calculate_pi/ subdirectory.
Working from the repository
If you have committed your changes on branches for the previous
exercises, just checkout the main branch again and create a new
branch for this exercise.
The code contains two subdirectories. We’ll be working in the serial
subdirectory.
$ cd code/calculate_pi/serial
$ ls
Makefile calculate_pi.c main.c proto.h
Load the relevant Intel compiler modules and then build the code with
make. It can be run with ./calc_pi N where N is the total number
of random points.
Run the code for different choices of N and plot the error as a
function of N.
What relationship do you observe between the accuracy of the
approximate result and N?
Hint
You can execute a sequence of commands in your batch script (just make
sure to allocate enough time in the queue for them all), rather than
running each one in its own job.
Solution
I get, as
expected,
approximately $\sqrt{N}$ convergence.
Remember that the first thing every MPI program should do is to
initialise MPI with
MPI_Init
and the final thing they should do is finalise MPI with
MPI_Finalize.
So do that in the main function (in main.c).
Exercise
Add code in main to determine the
size
of MPI_COMM_WORLD as well as the
rank
of the current process, and then print the results out on every
process.
Now compile and then run the code with two processes using mpirun.
Does what you observe make sense?
Solution
At this point, your code just runs the same calculations on multiple
processes (with the same random numbers). So you should have seen
multiple prints of the same output.
So far all we’ve done is take a serial program and parallelise it by
running the same calculations multiple times. This is not very useful.
The first thing to do is to ensure that the different processes use
different random numbers. These are generated using the C standard
library’s pseudo-random number generator. The initial state is
seeded in calculate_pi.
Exercise
Modify the code so that the seed depends on which process is
generating the random numbers.
Hint
The rank of a process is a unique identifier.
Run again on two processes, do you now see that the results are
different depending on the process?
Solution
If you change the call srand(42) to srand(rank) then different
process will produce different random numbers. Now when running in
parallel you should see (slightly) different results on the different
processes.
Note: parallel random numbers
The approach we use here does not produce statistically uncorrelated
random number streams. This does not really matter for our didactic
purposes, it just means that the effective number of Monte Carlo
samples is lower than what we specify.
If you need truly independent random number streams, then the
different approaches described
here give more
information on how to achieve it.
Now we have different processes using differents seeds we need to
divide up the work such that we take N samples in total (rather than
N samples on each process).
Modify the calculate_pi function such that the samples are
(reasonably) evenly divided between all the processes. After this
you’re producing a partial result on every process.
Finally, combine these partial results to produce the full result by
summing the number of points found to be in the circle across all
processes.
Hint
Remember that you wrote a function in the ring reduction exercise to add up partial values from all the
processes.
Alternately, you may find the function
MPI_Allreduce useful.
Question
Test your code running on 1, 2, 4, 8, and 16 cores. Produce a plot of
the runtime as a function of the number of cores.
What observations can you make?
Solution
If you did not manage, or you want to compare with a different
implementation, the directory code/calculate_pi/mpi contains a
parallel implementation.
You should see that the runtime decreases almost linearly with the
number of additional processes.
When I do this, I observe the following plot
Scaling of parallel calculation of $\pi$
This is sort of expected because there’s no communication and the
major bottleneck is how fast the random points can be generated.
It is good practice when using MPI that every function
which is collective explicitly receives as an argument the
communicator.
If you find yourself explicitly referring to one of the default
communicators (e.g. MPI_COMM_WORLD) in a function, think about how
to redesign the code so that you pass the communicator as an argument.
For example, rather than writing something like:
voiddivide_work(intN,int*nlocal){intsize,rank;MPI_Comm_size(MPI_COMM_WORLD,&size);MPI_Comm_rank(MPI_COMM_WORLD,&rank);/* Hand out evenly sized chunks with the leftover
* (N - (N/size)*size) being evenly split between higher-numbered
* processes */*nlocal=N/size+((N%size)>rank);}
Instead prefer
voiddivide_work(MPI_Commcomm,intN,int*nlocal){intsize,rank;MPI_Comm_size(comm,&size);MPI_Comm_rank(comm,&rank);/* Hand out evenly sized chunks with the leftover
* (N - (N/size)*size) being evenly split between higher-numbered
* processes */*nlocal=N/size+((N%size)>rank);}
This way, if your code is changed to use different communicators, it
will work transparently.