I think, you should check the MACHINE_CLEARS.SMC
performance counter (part of MACHINE_CLEARS
event) of the CPU (it is available in Sandy Bridge 1, which is used in your Air powerbook; and also available on your Xeon, which is Nehalem 2 – search “smc”). You can use oprofile
, perf
or Intel’s Vtune
to find its value:
Machine Clears
Metric Description
Certain events require the entire pipeline to be cleared and restarted from just after the last retired instruction. This metric measures three such events: memory ordering violations, self-modifying code, and certain loads to illegal address ranges.
Possible Issues
A significant portion of execution time is spent handling machine clears. Examine the MACHINE_CLEARS events to determine the specific cause.
MACHINE_CLEARS Event Code: 0xC3
SMC Mask: 0x04Self-modifying code (SMC) detected.
Number of self-modifying-code machine clears detected.
Intel also says about smc http://software.intel.com/en-us/forums/topic/345561 (linked from Intel Performance Bottleneck Analyzer’s taxonomy
This event fires when self-modifying code is detected. This can be typically used by folks who do binary editing to force it to take certain path (e.g. hackers). This event counts the number of times that a program writes to a code section. Self-modifying code causes a severe penalty in all Intel 64 and IA-32 processors. The modified cache line is written back to the L2 and LLC caches. Also, the instructions would need to be re-loaded hence causing performance penalty.
I think, you will see some such events. If they are, then CPU was able to detect act of self-modifying the code and raised the “Machine Clear” – full restart of pipeline. First stages are Fetch and they will ask L2 cache for new opcode. I’m very interested in the exact count of SMC events per execution of your code – this will give us some estimate about latencies.. (SMC is counted in some units where 1 unit is assumed to be 1.5 cpu cycles – B.6.2.6 of intel optimization manual)
We can see that Intel says “restarted from just after the last retired instruction.”, so I think last retired instruction will be mov
; and your nops are already in the pipeline. But SMC will be raised at mov’s retirement and it will kill everything in pipeline, including nops.
This SMC induced pipeline restart is not cheap, Agner has some measurements in the Optimizing_assembly.pdf – “17.10 Self-modifying code (All processors)” (I think any Core2/CoreiX is like PM here):
The penalty for executing a piece of code immediately after modifying it is approximately 19 clocks for P1, 31 for PMMX, and 150-300 for PPro, P2, P3, PM. The P4 will purge the entire trace cache after self-modifying code. The 80486 and earlier processors require a jump between the modifying and the modified code in order to flush the code cache.
…Self-modifying code is not considered good programming practice. It should be used only if
the gain in speed is substantial and the modified code is executed so many times that the
advantage outweighs the penalties for using self-modifying code.
Usage of different linear addresses to fail SMC detector was recommended here:
https://stackoverflow.com/a/10994728/196561 – I’ll try to find actual intel docume