strace is Your Friend
strace is one of those tools that admins use on their systems to track down problems. Typically it is used to debug or troubleshoot problems in *nix operating systems. But strace is one of those tools that is multi-purpose. In this, my first blog, I'll show how you can use strace to get started on examining the IO pattern of your serial codes. In subsequent blogs, I'll show how how to use strace to examine the IO pattern of your MPI codes using the same basic methods.

With HPC codes, it's always better to have more information about the behavior of the code than too little. Knowing the behavior of your code means knowing how it scales, if it's memory bandwidth intensive, if it's floating point intensive, how integer intensive is it, how much message passing is done, what kind of messages are passed, what size messages are passed, and on and on. With this information you can then start to configure your cluster to match your code.

Configuring or designing a cluster to match your code is a very different approach than in the past. With previous high performance systems, you had to modify your code to match the machine. This meant that every time a new series came out, you had to modify and tune your code, which is a long and laborious process. But now, clusters give you so many options that you can tune the hardware and the software to match your code. However, to do this, you need to know your code(s) very well.

One aspect of knowing your code that many people either miss, skip, or forget, is the IO pattern. Many times the reason they don't know is that they are not quite sure how to determine the IO pattern and develop IO requirements from them. In some cases, they don't even know how different IO rates affect the run time of their code. In this and the next several blogs, I hope to give you some a basic tool and tips to examine your IO patterns and develop several basic metrics from them. The tool I will be using is strace.

I'm sure there are some people who already know how to use strace to examine the IO of a code, but for those who don't know how, let's walk through a simple example. In this column, I'm going to start with a serial code. Sure, it's a simple serial code, but I want to show you how you can use strace to examine the IO pattern of your code and also how you can derive some basic metrics. In the next blog, I'll show how to use what we have learned in this blog and apply it to MPI codes.


Simple Serial Example
The code below is a simple C code that just creates an array of floats and writes it to a file. Sure it's a really simple code, but I want to show you how to use strace to look at the IO pattern of the code.


#include <stdio.h>
#define BUFSIZE 100000
int main(int argc, int *argv[] )
{
int i;
float buf[BUFSIZE];
char filename[128];
FILE *myfile;

for (i=0; i < BUFSIZE; i++)
bufIdea = 2.5 + BUFSIZE * i;
sprintf(filename, "testfile");
myfile = fopen(filename, "w");
fwrite(buf, sizeof(float), BUFSIZE, myfile);
fclose(myfile);
}

I compiled the code using gcc on my home box which has a simple software RAID-1 device with a couple of 40GB Seagate drives and uses ext3. We will use strace to run the resulting binary. In general, strace is used to capture the function calls from the binary. This includes calls to IO functions such as open, llseek, write, read, and close. The command line I will use is,


strace -tt -o ./serial.strace ./serial


Notice that I named the binary serial.
I used two options with strace. The first option, -tt gives microsecond timings for all function calls. The second option, -o serial.strace, sends the strace output to a file called serial.strace. The strace output is too long for this blog, but I will show you a few lines from the output and how you can derive some metrics from it.


Analyzing the strace Output
The IO pattern for the code is really simple. It just opens a file, writes the floating data to the file and then closes the data. It's pretty simple, but in this case, there are actually two writes. The first one writes about 397KB of data and the second writes only about 2.7KB. So we have a reasonably large write followed by a very small write. More complicated codes will have different IO patterns (more on that in future blogs). Let's take a closer look at the IO function calls.

We can ignore most of the stuff in the output since we're really interested in the IO part of it. The relevant IO portion(s) of the output are below:


14:47:03.559980 open("testfile", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
14:47:03.560250 write(3, "\0\0 @@Q\303G\240PCHP|\222HPP\303HP$\364H(|\22I(\346*I"..., 397312) = 397312
14:47:03.562485 write(3, "\0\25\240N\r\30\240N\33\33\240N(\36\240N5!\240NB$\240N"..., 2688) = 2688
14:47:03.562613 close(3)



The first line that I've listed opens the file. The next two lines are important because they actually write out the data. You can find out what the function write does by looking in the man pages, but you may have to use the command man 2 write to get the correct man pages. The function write writes data to a file descriptor (basically a file). You can see the first few bits of data that are being written inside the quotations in the strace output. For this example, the data is binary so the data doesn't much sense. But if the data is text or non-binary data, then you might be able to read the first part of the data. strace does not list all of the data being written to the file. The very last number, after the equals sign, is the number of bytes actually written to the file descriptor. For this example, there are two writes. The first one writes 397,312 bytes and the second writes 2,688 bytes. The total of the fwrite functions is 400,000 bytes (as it should be).

The fwrite function calls in the strace output file also contain more useful information. The time that the particular function begins is printed before the function. I used the -tt option which gives the time including microseconds. If we look at the time for the next function (the next line in the output), we can take the difference between the times to get the elapsed time to perform the previous function. For this example, it is the amount of time it took to perform the fwrite function. Usually it's good enough to look at the seconds and microseconds to determine the amount of time for the write or read function. In this case the elapsed time to perform the first fwrite function is,


03.562485 - 03.560250 = 0.002235 seconds (2235 microseconds)


We also know that the amount of data is 397,312 bytes. So the resulting IO rate is,


397,312 bytes / 0.002235 seconds = 1.778x10^8 bytes/sec


This is the same as 178 MB/s assuming that a MB is 1,000,000 bytes.



The amount of time the second fwrite function took is,


03.562613 - 03.562485 seconds = 0.000128 seconds (128 microseconds)


The amount of data written is 2,688 bytes so the IO rate is,


2,688 bytes / 0.000128 seconds = 2.1x10^7 bytes/sec


This is the same as 21 MB/s (a bit slower). Now you can go through and compute the IO rate for all read and write functions in your strace output! (Doesn't this cry out for automation?).


Observations
Let's compare the IO rates for the two fwrite functions. The first fwrite writes out a reasonable amount of data - 397KB. But the second fwrite only writes about 2.7KB, a much smaller amount of data. Now compare the two IO rates. The first IO rate is about 178 MB/s and the second IO rate is only about 21 MB/s. Why is there such a big difference?

There are a number of factors that influence the IO rate. The latency of the drive where it seeks to a certain location on the disk, the actual speed that the drive can write the data on the disk, how fast the file system can translate the data into blocks for the drive controller, any llseek function calls that move the data location to a different point in the file. And believe it not, the amount of data written has an impact on the IO rate.

For very small amounts of data the amount of time to write the data is almost the same as the seek time or the latency of the drive. This is similar to sending messages across a network. That is, for very small packets, the time to send the data is about the same as the latency of the network.

Examining the time for the second fwrite, we can see that the time is very short and the amount of data is much much smaller than the first fwrite. We can observe that writing small amounts of data results in a very low IO rate. Consequently, we can conclude that it is good to avoid small writes if possible. This is because small amounts of data have low IO rates, which can slow down our code(s).


What Have We Learned?
I hope this blog has introduced you to using strace to examine the IO pattern of your code. For this simple example we found out that the resulting binary does a reasonably large write (397KB) followed by a very small write (2.7KB). The strace output can also be used to examine the IO rates of your code. For this simple serial code we found that the first fwrite function writes data at about 178 MB/s (caching effects included) and the second, much smaller fwrite, has an IO rate of 21 MB/s.

From these numbers it can be observed that it's a good idea to avoid small writes because the IO rate is so low. So if your code does a great deal of small writes, and there are many codes that do this, your overall IO throughput will be very low.

In the next blog I will show you how to use strace with an MPI code. I'll also continue to look at more complicated IO patterns. Until then take a few moments and use strace to look at the IO of you favorite codes. Just be careful because strace can produce a huge amount of data (and no, the drive manufacturers didn't pay me to write this blog).


Jeff -