High Performance Time Measurement in Linux

Posted: September 8, 2010 in Programming
Tags: HPET, Linux, RDTSC, Timer

Recently, I decided to take MIT OCW Algorithms course. I wanted to actually measure the performance of various algorithms. So before I dived in to it, I decided to come up with a setup for measuring time taken. For this, we need high precision time measurement. I have used the Read Time Stamp Counter (RDTSC) instruction introduced in Pentium processors before. I have heard about High Precision Event Timers (HPET) introduced by Intel circa 2005. In this post we have a shootout between the two mechanisms.

The metrics we want to compare are

Resolution
Accuracy
Cost (in terms of CPU time)
Reliability

Before we get in to the actual testing, let us understand how to use HPET and RDTSC. Here is how we use HPET which is a POSIX standard.

#include <time.h>
TestHpet()
{
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
}

And here is how we use the RDTSC instruction. With RDTSC, we actually read the number of CPU clock cycles from a counter (Time Stamp Counter). This keeps incrementing for each CPU clock. This does not directly translate to actual time. This needs to be done by calibrating the number of CPU cycles per nanosecond and dividing the clock ticks by this calibrated value for actual nanoseconds. Since it is not guaranteed that this TSC value will be synchronized across CPU, we bind our process to CPU1 (I have a dual core Inter T7500 CPU) to eliminate TSC mismatch between the two CPU cores.

#include <stdint.h> /* for uint64_t */
#include <time.h>  /* for struct timespec */

/* assembly code to read the TSC */
static inline uint64_t RDTSC()
{
  unsigned int hi, lo;
  __asm__ volatile("rdtsc" : "=a" (lo), "=d" (hi));
  return ((uint64_t)hi << 32) | lo;
}

const int NANO_SECONDS_IN_SEC = 1000000000;
/* returns a static buffer of struct timespec with the time difference of ts1 and ts2
   ts1 is assumed to be greater than ts2 */
struct timespec *TimeSpecDiff(struct timespec *ts1, struct timespec *ts2)
{
  static struct timespec ts;
  ts.tv_sec = ts1->tv_sec - ts2->tv_sec;
  ts.tv_nsec = ts1->tv_nsec - ts2->tv_nsec;
  if (ts.tv_nsec < 0) {
    ts.tv_sec--;
    ts.tv_nsec += NANO_SECONDS_IN_SEC;
  }
  return &ts;
}

double g_TicksPerNanoSec;
static void CalibrateTicks()
{
  struct timespec begints, endts;
  uint64_t begin = 0, end = 0;
  clock_gettime(CLOCK_MONOTONIC, &begints);
  begin = RDTSC();
  uint64_t i;
  for (i = 0; i < 1000000; i++); /* must be CPU intensive */
  end = RDTSC();
  clock_gettime(CLOCK_MONOTONIC, &endts);
  struct timespec *tmpts = TimeSpecDiff(&endts, &begints);
  uint64_t nsecElapsed = tmpts->tv_sec * 1000000000LL + tmpts->tv_nsec;
  g_TicksPerNanoSec = (double)(end - begin)/(double)nsecElapsed;
}

/* Call once before using RDTSC, has side effect of binding process to CPU1 */
void InitRdtsc()
{
  unsigned long cpuMask;
  cpuMask = 2; // bind to cpu 1
  sched_setaffinity(0, sizeof(cpuMask), &cpuMask);
  CalibrateTicks();
}

void GetTimeSpec(struct timespec *ts, uint64_t nsecs)
{
  ts->tv_sec = nsecs / NANO_SECONDS_IN_SEC;
  ts->tv_nsec = nsecs % NANO_SECONDS_IN_SEC;
}

/* ts will be filled with time converted from TSC reading */
void GetRdtscTime(struct timespec *ts)
{
  GetTimeSpec(ts, RDTSC() / g_TicksPerNanoSec);
}

Now back to our metrics. This is how each mechanism fares.

Resolution

HPET API clock_gettime, gives the result in struct timespec. The maximum granularity of timespec is nanoseconds. This is what struct timespec can represent, actual resolution varies depending upon implementation. We can get the resolution through the API clock_getres(). On my Dell XPS 1530 with Intel core2duo T7500 CPU running Ubuntu 10.04, it has a resolution of 1 nanosecond. On the other hand, RDTSC instruction can have resolution of upto a CPU clock time. On my 2.2 GHz CPU that means resolution is 0.45 nanoseconds. Clearly RDTSC is the winner.

Accuracy

From my tests, both seemed to give consistently the same results agreeing with each other correct to 5 nanoseconds. Since I have no other reference, I assume both are equally accurate. So no winner.

Cost

I ran a simple test case where I measured the time taken for 1 million calls to both HPET and RDTSC. And here is the result.

HPET : 1 sec 482 msec 188 usec 38 nsec
RDTSC: 0 sec 103 msec 311 usec 752 nsec

RDTSC is the clear winner in this case by being 14 times cheaper than HPET.

Reliability

Well a quick look at the Wikipedia entry for RDTSC will give us an idea of how unreliable it is. So many factors affect it like

Multiple cores having different TSC values (we eliminated this by binding our process to 1 core)
CPU frequency scaling for power saving (we eliminated this by always being CPU intensive)
Hibernation of system will reset TSC value (we didn’t let our system hibernate)
Impact on portability due to varying implementation of CPUs (we ran only on the same Intel CPU)

So for application programming, RDTSC seems to be quite unreliable. HPET is a POSIX standard and is the clear winner.

Conclusion

Final score is RDTSC 2 and HPET 1. But there is more to this. RDTSC definitely has reliability and portability issues and may not be very useful for regular application programming. I was affected by CPU frequency scaling during my tests. In CalibrateTicks(), initially I used a sleep(1) to sleep for 1 second to calibrate the number of ticks in a nanosecond. I got values ranging from 0.23 to 0.55 instead of 2.2 (or very close to it since my CPU is 2.2 GHz). Once I switched the sleep(1) to wasting CPU in a for loop, it gave me consistent readings of 2.198 ticks per nanosecond.

But RDTSC is 14 times cheaper than HPET. This can be useful for certain benchmarking exercises as long as one is aware of its pitfalls and is cautious.

Comments

High Performance Time Measurement in Linux « Talueee's Blog says:

January 12, 2011 at 4:45 PM

[…] Link […]

Reply
- Peter Senna Tschudin says:
  
  October 4, 2011 at 6:28 PM
  
  Hi! Great code! Is it licensed under GPL or any similar license? Thank you!
  
  Reply
thanas says:

June 26, 2012 at 10:09 PM

Yeah but clock_gettime is a syscall meaning you have an overhead ( as far as I know overhead indecd by syscalls is not negligible ). Maybe using asm directives and using HPET directly would not give such a big difference in cost.

Reply
scai says:

July 29, 2012 at 5:43 PM

On Linux you probably want to use CLOCK_MONOTONIC_RAW instead of CLOCK_MONOTONIC, requires Kernel 2.6.28 though.

Reply
Jitter= Latency Variation | 그대안의 작은 호수 says:

August 10, 2012 at 1:42 PM

[…] High Performance Time Measurement in Linux […]

Reply
C++ High Performance Time Stamp in Unix & Windows (RDTSC Register) says:

September 20, 2012 at 10:19 AM

[…] post I found was written by Aby Thankachan, an engineer from India who mentions the problems of multiple cores in modern CPU’s. This code is very easy to […]

Reply
Andrea says:

February 3, 2013 at 5:03 PM

Hi,
I stumbled over this old posts while looking for a way to read the HPET on Linux – but it looks like it needs some corrections.

1.) you say that reading HPET is a POSIX standard, but are confusing two different things: clock_gettime(…) is a POSIX standard, but nowhere it specifies to be relying on HPET for its time source.
In fact, on recent Linux kernels (RHEL 6) the time source is configurable, and by default the TSC is used – if found to be reliable (see below).

So, if you used a recent cpu and kernel, you may actually have compared “raw” TSC with kernel-adjusted TSC 🙂

2.) about the issues with TSC, recent CPUs and OSs fare much better:
– the TSC is always running at a constant speed (the CPU’s “nominal” clock rate), independently of the actual clock rate (due to throttling)
– reasonably recent Linux versions (tested on RHEL 5 and later, kernel 2.6.18 and later) synchronise the TSC of all the CPUs once they are brought up. Together with the constant rate (and a reliable clock on multi-socket systems) this makes sure that the TSC gives the same value even if the process migrates across a different CPU (core or socket)

Bye,
.Andrea

Reply
- Peter Senna Tschudin says:
  
  February 4, 2013 at 12:40 AM
  
  Hi Andrea,
  
  I do not agree that TSC is that good and stable. When measuring tasks with CPU dynamic clock turned on, the run time of same task will vary according to the CPU clock. This is bad for benchmark. So if using TSC disable dynamic clock before starting. Check this out: https://github.com/petersenna/rdtscbench
  
  Reply
  - Alexander Kostadinov says:
    
    December 26, 2017 at 2:32 PM
    
    I believe that Andrea was referring to to the logic in linux kernel to figure out whether TSC is reliable or not. And use it as the time source only if it is reliable. And this is much more common now compared to near 5 years ago when you wrote the article. Very nice article btw. But it is true that kernel will choose the best timer which can be TSC or HPET on modern systems. One has to check `dmesg` or better:
    $ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
    tsc hpet acpi_pm
    $ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
    tsc
  - Alexander Kostadinov says:
    
    December 26, 2017 at 2:36 PM
    
    btw your observation appears to be correct. This is a comparison of clock sources performance by Red Hat:
    https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Reference_Guide/chap-Timestamping.html#example-Hardware_Clock_Cost_Comparison
Josema says:

June 12, 2014 at 5:19 PM

Hello, same question as Peter Senna:

“Hi! Great code! Is it licensed under GPL or any similar license? Thank you!”

I would like to ask for pemission to test and use this code. Thanks in advance for your reply.

Regards,

Josema

Reply
NDK Android – time stamps with low overhead for code profiler purposes | 我爱源码网 says:

March 5, 2015 at 11:58 PM

[…] tried reading Time Stamp Counter to convert results to clock time but it doesn't work for me ex.1 […]

Reply
AJ says:

July 15, 2015 at 3:22 PM

how to you use sleep to waste CPU for 1 second?

Reply
Srujan says:

April 12, 2016 at 1:12 AM

What do you do to offset the effects of cache? How do you know that the TSC is simply not running out of cache?

Reply
lebens versicherung schweiz says:

August 24, 2018 at 4:51 PM

Saved as a favorite, I love youjr site!

Reply
Daniel says:

May 30, 2019 at 11:43 PM

Great article, but in the InitRdtsc() function I needed to define the CPU mask
cpu_set_t cpuMask;
CPU_ZERO(&cpuMask);
CPU_SET(2, &cpuMask);

Reply
นักบอลสเปน says:

September 7, 2019 at 1:27 PM

Do you have any video of that? I’d want to find out more details.

Reply