Measuring point-to-point message latency with ping-pong
#
In this exercise we will write a simple code that does a message
ping-pong: sending a message back and forth between two processes.
We can use this to measure both the latency and bandwidth of the
network on our supercomputer. Which are both important measurements
when we’re looking at potential parallel performance: they help us to
decide if our code is running slowly because of our bad choices, or
limitations in the hardware.
I provide a template in mpi/ping-pong/ping-pong.c in the code/mpi/ping-pong
subdirectory of the repository that you can compile
with mpicc. It takes one argument, the size of the message (in
bytes) to exchange.
You should implement the ping_pong function which should send a
message of the given size from rank 0 to rank 1, after which rank 1
should send the same message back to rank 0. Ensure that the code also
works with more than two processes (all other ranks should just do
nothing).
Add timing around the ping_pong call to determine how long it takes
to send these messages.
Run your code on the Hamilton compute nodes for a range of messages
sizes from one byte to 64MB.
Produce a plot of the time to send a message as a function of message
size.
Using
numpy.polyfit
(or your favourite linear regression scheme), fit our proposed model
to your data.
What values of $\alpha$ and $\beta$ do you get?
Solution
I provide a sample solution in
mpi/ping-pong/ping-pong-solution.c. When I run it on one node
using messages ranging from 1 Byte to 16MB (in powers of 2), I get
something like the below.
Ping pong time and model fit on Hamilton v6 (par6.q queue), this is the older hardware
It looks like a piecewise linear model is right for this MPI
implementation. Between 512KB and 1MB, the time jumps up. This is
probably when the implementation is switching protocols.
I get a latency of $\alpha_\text{intra} \approx 500\text{ns}$, and an inverse
bandwidth of $\beta \approx 4.5\times 10^{-10}\text{s/byte}$ or a
bandwidth of around $2\text{GB/s}$.
Question
Perform the same experiment, but this time, place the two processes on
different Hamilton compute nodes. Do you observe a difference in the
performance?
To do this, you’ll need to write a SLURM batch script that specifies
# Two nodes#SBATCH --nodes=2# One process per node#SBTACH --ntasks-per-node=1
Solution
If I do this, I see that the inter-node latency on Hamilton 6 is
pretty bad, although asymptotically it seems like the bandwidth is the
same as for inter-node. This time I ran out to messages of 64MB. The
slow message at 32MB appears to be repeatable, but I don’t understand
what is going on.
Notice that when going across nodes, the switch in protocol happens at
a lower size (it looks like 256KB, rather than 1MB).
The plot also has results for Hamilton 7 which performs somewhat
better.
Ping pong latency on Hamilton 6 and Hamilton 7.
If I fit our linear model to the inter-node data, I get
$\alpha_\text{inter} \approx 6\mu\text{s} \approx 10
\alpha_\text{intra}$. The inverse bandwidth is about the same,
resulting in a network bandwidth of around $2\text{GB/s}$.
Fitting the model to the Hamilton 7 data, the intra-node latency is
still around $500\text{ns}$, but now the asymptotic bandwidth is
around $6.5\text{GB/s}$, so the network is much better inside a node.
However, we see that between nodes, it performs in a similar manner to
the Hamilton 6 case.
This section is optional, but possibly interesting.
Solution
Solutions for this section are left as an exercise.
One thing that can affect performance of real MPI codes is the message
latency, and particularly if there is any variability. This might be
affected by other processes that happen to be using the network, or
our own code, or operating system level variability. We’ll see if we
can observe any on Hamilton.
Modify your code so that rather than just timing many ping-pong
iterations, it records the time for each of the many iterations
separately.
Use this information to compute the mean ping-pong time, along with
the standard deviation and the minimum and maximum times.
Hint
You can allocate an array for your timing data with
intnrepeats=1000;/* Or some approriate number */double*timing=malloc(nrepeats*sizeof(*double));...;for(inti=0;i<nrepeats;i++){start=MPI_Wtime();ping_pong(...);end=MPI_Wtime();timing[i]=end-start;}...;/* Compute statistics */free(timing);/* Don't forget to release the memory! */
Produce a plot of these data, using the standard deviation as error
bars and additionally showing the minimum and maximum times as
outliers.
Question
What, if any, variability do you observe?
Does it change if you move from a single compute node to two nodes?
Finally, we’ll look at whether having more messages “in flight” at
once effects performance.
Rather than running with two processes, you should run with full
compute nodes (24 processes per node).
Modify your ping-pong code so that all ranks participate in pairwise
messaging.
Divide the processes into a “bottom” and “top” half. Suppose we are
using size processes in total. Processes with rank < size/2 are in
the “bottom” half, the remainder are in the “top” half.
A process in the bottom half should send a message to its matching
pair in the top half (rank + size/2), that process should then
return the message (to rank - size/2).
Again measure time and variability, and produce a plot.