Perf stats for doing "nothing"
Posted on February 11, 2012 with tags tech. See the previous or next posts.
I’ve recently discovered the perf
Linux tool. I heard that oprofile
was deprecated and that there is a new tool, and I noted down to try
it sometime.
Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.
The problem with perf stats is that I hate bloat, or even perceived bloat. Even when it doesn’t affect me in any way, the concept of wasted cycles makes me really sad.
You probably can guess where this is going… I said, well, let’s see what perf says about a simple “null” program. Surely doing nothing should be just a small number of instructions, right?
Note: I think that perf also records kernel-side code, because the
lowest I could get was about ~50K instructions for starting a null
program in assembler that doesn’t use libc and just executes the
syscall
asm instruction. However, these ~50K instructions are noise
the moment you start to use more high-level languages. Yes, this is
expected, but the I was still shocked. And there’s lots of delta
between languages I’d expected to behave somewhat identical.
Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.
Test setup
Each program was the equivalent of ‘exit 0’ in the appropriate form
for the language. During the measurements, the machine was as much as
possible idle (single-user mode, measurements run at real-time
priority, etc.). For compiled languages, -O2
was used. For scripts,
a simple #!/path/to/interpreter
(without options, except in the case
of Python, see below) was used. Each program/script was run 500 times
(perf’s -r 500
) and I’ve checked that the variations were small
(±0.80% on the metrics I used).
You can find all the programs I’ve used at http://git.k1024.org/perf-null.git/, the current tests are for the tag version perf-null-0.1.
The raw data for the below tables/graphs is at perf-null/log-4.
Results
Compiled languages
Language | Cycles | Instructions |
---|---|---|
asm | 63K | 51K |
c-dietlibc | 74K | 57K |
c-libc-static | 177K | 107K |
c-libc-shared | 506K | 300K |
c++-static | 178K | 107K |
c++-dynamic | 1,750K | 1,675K |
haskell-single | 2,229K | 1,338K |
haskell-threaded | 2,629K | 1,522K |
ocaml-bytecode | 3,271K | 2,741K |
ocaml-native | 1,042K | 666K |
Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn’t manage to compile a program dynamically-linked against dietlibc.
C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude (!) more instructions. I would guess that the initialisation of the standard C++ library is complex?
Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.
OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ☺. Nice!
Shells
Language | Cycles | Instructions |
---|---|---|
dash | 766K | 469K |
bash | 1,680K | 1,044K |
mksh | 1,258K | 942K |
mksh-static | 504K | 322K |
So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I’d guess that dash is implemented in C ☺.
Next, bash is indeed slower on startup than dash, and by slightly more
than 2× (both instructions and cycles). So yes, switching /bin/sh
from bash to dash makes sense.
I wasn’t aware of mksh
, so thanks for the comments. It is, in the
static variant, more efficient that dash, by about 1.5×. However, the
dynamically linked version doesn’t look too great (dash is also
dynamically linked; I would guess a statically-linked dash “beats”
mksh-static).
Text processing
I’ve added perl here (even though it’s a ‘full’ language) just for comparison; it’s also in the next section.
Language | Cycles | Instructions |
---|---|---|
mawk | 849K | 514K |
gawk | 1,363K | 980K |
perl | 2,946K | 2,213K |
A normal spread. I knew the reason why mawk is Priority: required
is
that it’s faster than gawk, but I wouldn’t have guessed it’s almost
twice as fast.
Interpreted languages
Here is where the fun starts…
Language | Cycles | Instructions |
---|---|---|
lua 5.1 | 1,947K | 1,485K |
lua 5.2 | 1,724K | 1,335K |
lua jit | 1,209K | 803K |
perl | 2,946K | 2,213K |
tcl 8.4 | 5,011K | 4,552K |
tcl 8.5 | 6,888K | 6,022K |
tcl 8.6 | 8,196K | 7,236K |
ruby 1.8 | 7,013K | 6,128K |
ruby 1.9.3 | 35,870K | 35,022K |
python 2.6 -S | 11,752K | 10,247K |
python 2.7 -S | 11,438K | 10,198K |
python 3.2 -S | 29,003K | 27,409K |
pypy -S | 21,106K | 10,036K |
python 2.6 | 25,143K | 21,989K |
python 2.7 | 47,325K | 50,217K |
python 2.7 -O | 47,341K | 50,185K |
python 3.2 | 113,567K | 124,133K |
python 3.2 -O | 113,424K | 124,133K |
pypy | 90,779K | 68,455K |
The numbers here are not quite what I expected. There’s a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).
I wasn’t familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.
Perl is the one that goes above C++‘s instructions count, but not by much. From the point of view of the system, a Perl ’hello world’ is only about 1.3×-1.6× slower than a C++ one. Not bad, not bad.
Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn’t expect it to be slower to startup than Perl; good to know.
Last category is Python. Oh my. If you run perf stat python -c 'pass'
you get some unbelievable numbers, like 50M instructions to
do, well, nothing. Yes, it has a GC, yes, it does import modules at
runtime, but still… On closer investigation, the site
module and the
imports it does do eat a lot of time. Running a simpler python -S
brings it back to a more reasonable 10M instructions, which is in-line
with the other interpreted languages.
However, even with the -S taken into account, Python also slows down
across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a
3× slowdown from 2.7 to 3.2. Trying the “optimised” version (-O
)
doesn’t help at all. Trying pypy, which was based on Python 2.7, makes
it around 2× slower to startup (both with and without -S
).
So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.
Java
Java is in its own category; you guess why ☺, right?
GCJ was version 4.6, whereas by java
below I mean OpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4)
.
Language | Cycles | Instructions |
---|---|---|
null-gcj | 97,156K | 74,576K |
java -jamvm | 85,535K | 80,102K |
java -server | 147,174K | 136,803K |
java -zero | 132,967K | 124,977K |
java -cacao | 229,799K | 205,312K |
Using gcj to compile to “native code” (not sure whether that’s native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.
However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.
The other thing about java is that it was the only one that couldn’t
be put nicely in a file that you just ‘exec’ (there is binfmt_misc
indeed, but that doesn’t allow different Java classes to use different
Java VMs, so I don’t count this), as opposed to every single other
thing I tested here. Someone didn’t grow on Unix?
Comparative analysis
Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:
Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there’s something to this (mostly joking, two data points don’t make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?
Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…
Ruby, TCL and Python have consistent values across versions.
And that’s about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:
Note that the two lines shouldn’t really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.
There are two low-value outliers:
- dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
- gcj-compiled java has a very low branch miss percentage (only 2.82%!!!), even though is has a “regular” branch percentage (20.85%)
So it seems the gcj libraries are well optimised? I’m not familiar enough with this topic, but on the graph it does indeed stand out.
On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?
I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ☺. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).
Summary
So to recap, counting just instructions:
- going from dietlibc to glibc: 2× increase
- going from statically-linked libc to dynamically-linked libc: doubles it again
- going from C to C++: 5× increase
- C++ to Perl: 1.3×
- Perl to Ruby: 3×
- Ruby to Python (-S): 1.6x
- Python -S to regular Python: 5×
- Python to Java: 1×-2×, depending on version/runtime
- branch percentage (per total instructions) is quite consistent across all of the programs
Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.
On the other hand, I learned some interesting things while doing it, so it wasn’t quite for nothing ☺.