Pitfalls of TSC Usage (2015) by limoce

Share This Article

Sed ut perspiciatis unde.

The content reuse need include the original link: http://oliveryang.net

1. Latency measurement in user space

While user application developers are working on performance sensitive code, one common requirement is do latency/time
measurement in their code. This kind of code could be temporary code for debug, test or profiling purpose, or permanent
code that could provide performance tracing data in software production mode.

Linux kernel provides gettimeofday() and clock_gettime() system calls for user application high resolution time measurement.
The gettimeofday() is us level, and clock_gettime is ns level. However, the major concerns of these system calls usage are
the additional performance cost caused by calling themselves.

In order to minimize the perf cost of gettimeofday() and clock_gettime() system calls, Linux kernel uses the
vsyscalls(virtual system calls) and VDSOs (Virtual Dynamically linked Shared Objects) mechanisms to avoid the cost
of switching from user to kernel. On x86, gettimeofday() and clock_gettime() could get better performance
due to vsyscalls kernel patch,
by avoiding context switch from user to kernel space. But some other arch still need follow the regular
system call code path. This is really hardware dependent optimization.

For most use cases of TSC, vsyscalls for gettimeofday() and clock_gettime() could reduce major performance overheads,
because it avoids to user/kernel context switch and tries to use rdtsc instructions to get the TSC register value directly.
Especially, the system calls provide better porting and error handling capabilities. For example, on some platforms
an undetectable TSC sync problem found among multiple CPUs, the
gettimeofday() and clock_gettime() vsyscalls try to work around the problem.

2. Why using TSC?

Although vsyscalls implementation of gettimeofday() and clock_gettime() is faster than regular system calls, the perf cost
of them is still too high to meet the latency measurement requirements for some perf sensitive application.

The TSC (time stamp counter) provided by x86 processors is a high-resolution counter that can be read with a single
instruction (RDTSC). On Linux this instruction could be executed from user space directly, that means user applications could
use one single instruction to get a fine-grained timestamp (nanosecond level) with a much faster way than vsyscalls.

Following code are typical implementation for rdtsc() api in user space application,

static uint64_t rdtsc(void)
{
	uint64_t var;
	uint32_t hi, lo;

	__asm volatile
	    ("rdtsc" : "=a" (lo), "=d" (hi));

	var = ((uint64_t)hi << 32) | lo;
	return (var);
}

The result of rdtsc is CPU cycle, that could be converted to nanoseconds by a simple calculation.

ns = CPU cycles * (ns_per_sec / CPU freq)

In Linux kernel, it uses more complex way to get a better results,

/*
 * Accelerators for sched_clock()
 * convert from cycles(64bits) => nanoseconds (64bits)
 *  basic equation:
 *              ns = cycles / (freq / ns_per_sec)
 *              ns = cycles * (ns_per_sec / freq)
 *              ns = cycles * (10^9 / (cpu_khz * 10^3))
 *              ns = cycles * (10^6 / cpu_khz)
 *
 *      Then we use scaling math (suggested by george@mvista.com) to get:
 *              ns = cycles * (10^6 * SC / cpu_khz) / SC
 *              ns = cycles * cyc2ns_scale / SC
 *
 *      And since SC is a constant power of two, we can convert the div
 *  into a shift.
 *
 *  We can use khz divisor instead of mhz to keep a better precision, since
 *  cyc2ns_scale is limited to 10^6 * 2^10, which fits in 32 bits.
 *  (mathieu.desnoyers@polymtl.ca)
 *
 *                      -johnstul@us.ibm.com "math is hard, lets go shopping!"
 */

Finally, the code of latency measurement could be,

start = rdtsc();

/* put code you want to measure here */

end = rdtsc();

cycle = end - start;

latency = cycle_2_ns(cycle)

In fact, above rdtsc implementation are problematic, and not encouraged by Linux kernel.
The major reason is, TSC mechanism is rather unreliable, and even Linux kernel had the hard time to handle it.

That is why Linux kernel does not provide the rdtsc api to user application. However, Linux kernel does not limit the
rdtsc instruction to be executed at privilege level, although x86 support the setup. That means, there is nothing stopping
Linux application read TSC directly by above implementation, but these applications have to prepare to handle some
strange TSC behaviors due to some known pitfalls.

3. Known TSC pitfalls

3.1 TSC unstable hardware

3.1.1 CPU TSC capabilities

Intel CPUs have 3 sort of TSC behaviors,

Variant TSC

The first generation of TSC, the TSC increments could be impacted by CPU frequency changes.
This is started from a very old processors (P4).

Constant TSC

The TSC increments at a constant rate, even CPU frequency get changed. But the TSC could be stopped when CPU run into
deep C-state. Constant TSC is supported before Nehalem, and not as good as invariant TSC.

Invariant TSC

The invariant TSC will run at a constant rate in all ACPI P-, C-, and T-states. This is the architectural behavior
moving forward. Invariant TSC only appears on Nehalem-and-later Intel processors.

See Intel 64 Architecture SDM Vol. 3A “17.12.1 Invariant TSC”.

Linux defines several CPU feature bits per CPU differences,

X86_FEATURE_TSC

The TSC is available in CPU.

X86_FEATURE_CONSTANT_TSC

When CPU has a constant TSC.

X86_FEATURE_NONSTOP_TSC

When CPU does not stop for C-state.

The CONSTANT_TSC and NONSTOP_TSC flag combinations are enabled for invariant TSC.
Please refer to this kernel patch
for implementation.

If CPU has no “Invariant TSC” feature, it might cause the TSC problems, when kernel enables P or C state: as known as
turbo boost, speed-step, or CPU power management features.

For example, if NONSTOP_TSC feature is not detected by Linux kernel, when CPU ran into deep C-state for power saving,
Intel idle driver
will try to mark TSC with unstable flag,

if (((mwait_cstate + 1) > 2) &&
	!boot_cpu_has(X86_FEATURE_NONSTOP_TSC))
	mark_tsc_unstable("TSC halts in idle"
			" states deeper than C2");

The ACPI CPU idle driver has the similar logic to check NONSTOP_TSC for deep C-state.

Please use below command on Linux to check CPU capabilities,

$ cat  /proc/cpuinfo | grep -E "constant_tsc|nonstop_tsc"

X86_FEATURE_TSC_RELIABLE

A synthetic flag, TSC sync checks are skipped.

CPU feature bits can only indicate the TSC stability in a UP system. For a SMP system, there are no explicit ways could be
used to ensure TSC reliability. The TSC sync test is the only way to test SMP TSC reliability.
However, some virtualization solution does provide good TSC sync mechanism. In order to handle some false
positive test results, VMware create a new synthetic
TSC_RELIABLE feature bit
in Linux kernel to bypass TSC sync testing. This flag is also used by other kernel components to bypass TSC sync
testing. Below command could be used to check this new synthetic CPU feature,

$ cat  /proc/cpuinfo | grep "tsc_reliable"

If we could get the feature bit set on CPU, we should be able to trust the TSC source on this platform. But keep in
mind, software bugs in TSC handling still could cause the probl

Pitfalls of TSC Usage (2015) by limoce

Pitfalls of TSC Usage (2015) by limoce

Share This Article

Newsletter

1. Latency measurement in user space

2. Why using TSC?

3. Known TSC pitfalls

3.1 TSC unstable hardware

3.1.1 CPU TSC capabilities

HackTech

Leave a comment Cancel reply

Editor's Choice

Pitfalls of TSC Usage (2015) by limoce

Pitfalls of TSC Usage (2015) by limoce

Share This Article

Newsletter

1. Latency measurement in user space

2. Why using TSC?

3. Known TSC pitfalls

3.1 TSC unstable hardware

3.1.1 CPU TSC capabilities

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter