Simple MPI parallelism #

In this exercise we’re going to compute an approximation to the value of π using a simple Monte Carlo method. We do this by noticing that if we randomly throw darts at a square, the fraction of the time they will fall within the incircle approaches π.

Consider a square with side-length $2r$ and an inscribed circle with radius $r$.

The ratio of areas is

$$ \frac{A_\text{circle}}{A_\text{square}} = \frac{\pi r^2}{4 r^2} = \frac{\pi}{4}. $$

If we therefore draw $X$ uniformly at random from the distribution $ \mathcal{U}(0, r) \times \mathcal{U}(0, r)$, then the probability that $X$ is in the circle is

$$ p_\text{in} = \frac{\pi}{4}. $$

We can therefore approximate π by picking $N_\text{total}$ random points and counting the number, $N_\text{circle}$, that fall within the circle

$$ \pi_\text{numerical} = 4 \frac{N_\text{circle}}{N_\text{total}} $$

Obtaining the code #

The code for this exercise is in the repository in the code/calculate_pi/ subdirectory.

Working from the repository

If you have committed your changes on branches for the previous exercises, just checkout the main branch again and create a new branch for this exercise.

The code contains two subdirectories. We’ll be working in the serial subdirectory.

$ cd code/calculate_pi/serial
$ ls
Makefile       calculate_pi.c main.c         proto.h

Load the relevant Intel compiler modules and then build the code with make. It can be run with ./calc_pi N where N is the total number of random points.

Convergence #

Exercise

Run the code for different choices of N and plot the error as a function of N.

What relationship do you observe between the accuracy of the approximate result and N?

Hint

You can execute a sequence of commands in your batch script (just make sure to allocate enough time in the queue for them all), rather than running each one in its own job.

Solution

I get, as expected, approximately $\sqrt{N}$ convergence.

$Convergence of the Monte-Carlo estimate of $\pi$$

Convergence of the Monte-Carlo estimate of $\pi$

Parallelisation with MPI #

We’re now going to parallelise this computation with MPI. The first thing to do is to load the appropriate module to gain access to the MPI compilers.

Hamilton

$ module load intelmpi/intel/2018.2

COSMA

$ module load intel_mpi/2018

Initialising MPI #

Remember that the first thing every MPI program should do is to initialise MPI with MPI_Init and the final thing they should do is finalise MPI with MPI_Finalize.

So do that in the main function (in main.c).

Exercise

Add code in main to determine the size of MPI_COMM_WORLD as well as the rank of the current process, and then print the results out on every process.

Now compile and then run the code with two processes using mpirun. Does what you observe make sense?

Solution

At this point, your code just runs the same calculations on multiple processes (with the same random numbers). So you should have seen multiple prints of the same output.

Parallelising the random number generation #

So far all we’ve done is take a serial program and parallelise it by running the same calculations multiple times. This is not very useful.

The first thing to do is to ensure that the different processes use different random numbers. These are generated using the C standard library’s pseudo-random number generator. The initial state is seeded in calculate_pi.

Exercise

Modify the code so that the seed depends on which process is generating the random numbers.

Hint

The rank of a process is a unique identifier.

Run again on two processes, do you now see that the results are different depending on the process?

Solution

If you change the call srand(42) to srand(rank) then different process will produce different random numbers. Now when running in parallel you should see (slightly) different results on the different processes.

Note: parallel random numbers

The approach we use here does not produce statistically uncorrelated random number streams. This does not really matter for our didactic purposes, it just means that the effective number of Monte Carlo samples is lower than what we specify.

If you need truly independent random number streams, then the different approaches described here give more information on how to achieve it.

Dividing the work #

Now we have different processes using differents seeds we need to divide up the work such that we take N samples in total (rather than N samples on each process).

Modify the calculate_pi function such that the samples are (reasonably) evenly divided between all the processes. After this you’re producing a partial result on every process.

Finally, combine these partial results to produce the full result by summing the number of points found to be in the circle across all processes.

Hint

Remember that you wrote a function in the ring reduction exercise to add up partial values from all the processes.

Alternately, you may find the function MPI_Allreduce useful.

Question

Test your code running on 1, 2, 4, 8, and 16 cores. Produce a plot of the runtime as a function of the number of cores.

What observations can you make?

Solution

If you did not manage, or you want to compare with a different implementation, the directory code/calculate_pi/mpi contains a parallel implementation.

You should see that the runtime decreases almost linearly with the number of additional processes.

When I do this, I observe the following plot

$Scaling of parallel calculation of $\pi$$

Scaling of parallel calculation of $\pi$

This is sort of expected because there’s no communication and the major bottleneck is how fast the random points can be generated.

Advice when writing MPI programs #

It is good practice when using MPI that every function which is collective explicitly receives as an argument the communicator.

If you find yourself explicitly referring to one of the default communicators (e.g. MPI_COMM_WORLD) in a function, think about how to redesign the code so that you pass the communicator as an argument.

For example, rather than writing something like:

void divide_work(int N, int *nlocal)
{
  int size, rank;
  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  /* Hand out evenly sized chunks with the leftover 
   * (N - (N/size)*size) being evenly split between higher-numbered
   * processes */
  *nlocal = N / size + ((N % size) > rank);
}

Instead prefer

void divide_work(MPI_Comm comm, int N, int *nlocal)
{
  int size, rank;
  MPI_Comm_size(comm, &size);
  MPI_Comm_rank(comm, &rank);
  /* Hand out evenly sized chunks with the leftover 
   * (N - (N/size)*size) being evenly split between higher-numbered
   * processes */
  *nlocal = N / size + ((N % size) > rank);
}

This way, if your code is changed to use different communicators, it will work transparently.