High Performance Time Measurement in Linux

Posted: September 8, 2010 in Programming
Tags: , , ,

Recently, I decided to take MIT OCW Algorithms course. I wanted to actually measure the performance of various algorithms. So before I dived in to it, I decided to come up with a setup for measuring time taken. For this, we need high precision time measurement. I have used the Read Time Stamp Counter (RDTSC) instruction introduced in Pentium processors before. I have heard about High Precision Event Timers (HPET) introduced by Intel circa 2005. In this post we have a shootout between the two mechanisms.

The metrics we want to compare are

  • Resolution
  • Accuracy
  • Cost (in terms of CPU time)
  • Reliability

Before we get in to the actual testing, let us understand how to use HPET and RDTSC. Here is how we use HPET which is a POSIX standard.

#include <time.h>
TestHpet()
{
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
}

And here is how we use the RDTSC instruction. With RDTSC, we actually read the number of CPU clock cycles from a counter (Time Stamp Counter). This keeps incrementing for each CPU clock. This does not directly translate to actual time. This needs to be done by calibrating the number of CPU cycles per nanosecond and dividing the clock ticks by this calibrated value for actual nanoseconds. Since it is not guaranteed that this TSC value will be synchronized across CPU, we bind our process to CPU1 (I have a dual core Inter T7500 CPU) to eliminate TSC mismatch between the two CPU cores.

#include <stdint.h> /* for uint64_t */
#include <time.h>  /* for struct timespec */

/* assembly code to read the TSC */
static inline uint64_t RDTSC()
{
  unsigned int hi, lo;
  __asm__ volatile("rdtsc" : "=a" (lo), "=d" (hi));
  return ((uint64_t)hi << 32) | lo;
}

const int NANO_SECONDS_IN_SEC = 1000000000;
/* returns a static buffer of struct timespec with the time difference of ts1 and ts2
   ts1 is assumed to be greater than ts2 */
struct timespec *TimeSpecDiff(struct timespec *ts1, struct timespec *ts2)
{
  static struct timespec ts;
  ts.tv_sec = ts1->tv_sec - ts2->tv_sec;
  ts.tv_nsec = ts1->tv_nsec - ts2->tv_nsec;
  if (ts.tv_nsec < 0) {
    ts.tv_sec--;
    ts.tv_nsec += NANO_SECONDS_IN_SEC;
  }
  return &ts;
}

double g_TicksPerNanoSec;
static void CalibrateTicks()
{
  struct timespec begints, endts;
  uint64_t begin = 0, end = 0;
  clock_gettime(CLOCK_MONOTONIC, &begints);
  begin = RDTSC();
  uint64_t i;
  for (i = 0; i < 1000000; i++); /* must be CPU intensive */
  end = RDTSC();
  clock_gettime(CLOCK_MONOTONIC, &endts);
  struct timespec *tmpts = TimeSpecDiff(&endts, &begints);
  uint64_t nsecElapsed = tmpts->tv_sec * 1000000000LL + tmpts->tv_nsec;
  g_TicksPerNanoSec = (double)(end - begin)/(double)nsecElapsed;
}

/* Call once before using RDTSC, has side effect of binding process to CPU1 */
void InitRdtsc()
{
  unsigned long cpuMask;
  cpuMask = 2; // bind to cpu 1
  sched_setaffinity(0, sizeof(cpuMask), &cpuMask);
  CalibrateTicks();
}

void GetTimeSpec(struct timespec *ts, uint64_t nsecs)
{
  ts->tv_sec = nsecs / NANO_SECONDS_IN_SEC;
  ts->tv_nsec = nsecs % NANO_SECONDS_IN_SEC;
}

/* ts will be filled with time converted from TSC reading */
void GetRdtscTime(struct timespec *ts)
{
  GetTimeSpec(ts, RDTSC() / g_TicksPerNanoSec);
}

Now back to our metrics. This is how each mechanism fares.

Resolution

HPET API clock_gettime, gives the result in struct timespec. The maximum granularity of timespec is nanoseconds. This is what struct timespec can represent, actual resolution varies depending upon implementation. We can get the resolution through the API clock_getres(). On my Dell XPS 1530 with Intel core2duo T7500 CPU running Ubuntu 10.04, it has a resolution of 1 nanosecond. On the other hand, RDTSC instruction can have resolution of upto a CPU clock time. On my 2.2 GHz CPU that means resolution is 0.45 nanoseconds. Clearly RDTSC is the winner.

Accuracy

From my tests, both seemed to give consistently the same results agreeing with each other correct to 5 nanoseconds. Since I have no other reference, I assume both are equally accurate. So no winner.

Cost

I ran a simple test case where I measured the time taken for 1 million calls to both HPET and RDTSC. And here is the result.

HPET : 1 sec 482 msec 188 usec 38 nsec
RDTSC: 0 sec 103 msec 311 usec 752 nsec

RDTSC is the clear winner in this case by being 14 times cheaper than HPET.

Reliability

Well a quick look at the Wikipedia entry for RDTSC will give us an idea of how unreliable it is. So many factors affect it like

  • Multiple cores having different TSC values (we eliminated this by binding our process to 1 core)
  • CPU frequency scaling for power saving (we eliminated this by always being CPU intensive)
  • Hibernation of system will reset TSC value (we didn’t let our system hibernate)
  • Impact on portability due to varying implementation of CPUs (we ran only on the same Intel CPU)

So for application programming, RDTSC seems to be quite unreliable. HPET is a POSIX standard and is the clear winner.

Conclusion

Final score is RDTSC 2 and HPET 1. But there is more to this. RDTSC definitely has reliability and portability issues and may not be very useful for regular application programming. I was affected by CPU frequency scaling during my tests. In CalibrateTicks(), initially I used a sleep(1) to sleep for 1 second to calibrate the number of ticks in a nanosecond. I got values ranging from 0.23 to 0.55 instead of 2.2 (or very close to it since my CPU is 2.2 GHz). Once I switched the sleep(1) to wasting CPU in a for loop, it gave me consistent readings of 2.198 ticks per nanosecond.

But RDTSC is 14 times cheaper than HPET. This can be useful for certain benchmarking exercises as long as one is aware of its pitfalls and is cautious.

Comments
  1. thanas says:

    Yeah but clock_gettime is a syscall meaning you have an overhead ( as far as I know overhead indecd by syscalls is not negligible ). Maybe using asm directives and using HPET directly would not give such a big difference in cost.

  2. scai says:

    On Linux you probably want to use CLOCK_MONOTONIC_RAW instead of CLOCK_MONOTONIC, requires Kernel 2.6.28 though.

  3. […] High Performance Time Measurement in Linux […]

  4. […] post I found was written by Aby Thankachan, an engineer from India who mentions the problems of multiple cores in modern CPU’s. This code is very easy to […]

  5. Andrea says:

    Hi,
    I stumbled over this old posts while looking for a way to read the HPET on Linux – but it looks like it needs some corrections.

    1.) you say that reading HPET is a POSIX standard, but are confusing two different things: clock_gettime(…) is a POSIX standard, but nowhere it specifies to be relying on HPET for its time source.
    In fact, on recent Linux kernels (RHEL 6) the time source is configurable, and by default the TSC is used – if found to be reliable (see below).

    So, if you used a recent cpu and kernel, you may actually have compared “raw” TSC with kernel-adjusted TSC 🙂

    2.) about the issues with TSC, recent CPUs and OSs fare much better:
    – the TSC is always running at a constant speed (the CPU’s “nominal” clock rate), independently of the actual clock rate (due to throttling)
    – reasonably recent Linux versions (tested on RHEL 5 and later, kernel 2.6.18 and later) synchronise the TSC of all the CPUs once they are brought up. Together with the constant rate (and a reliable clock on multi-socket systems) this makes sure that the TSC gives the same value even if the process migrates across a different CPU (core or socket)

    Bye,
    .Andrea

  6. Josema says:

    Hello, same question as Peter Senna:

    “Hi! Great code! Is it licensed under GPL or any similar license? Thank you!”

    I would like to ask for pemission to test and use this code. Thanks in advance for your reply.

    Regards,

    Josema

  7. […] tried reading Time Stamp Counter to convert results to clock time but it doesn't work for me ex.1 […]

  8. AJ says:

    how to you use sleep to waste CPU for 1 second?

  9. Srujan says:

    What do you do to offset the effects of cache? How do you know that the TSC is simply not running out of cache?

  10. Saved as a favorite, I love youjr site!

  11. Daniel says:

    Great article, but in the InitRdtsc() function I needed to define the CPU mask
    cpu_set_t cpuMask;
    CPU_ZERO(&cpuMask);
    CPU_SET(2, &cpuMask);

  12. Do you have any video of that? I’d want to find out more details.

Leave a comment