Perf stats for "doing nothing"

I've recently discovered the perf Linux tool. I heard that oprofile was deprecated and that there is a new tool, and I noted down to try it sometime.

Updated: more languages, fixed typos, more details, some graphs. Apologies if this shows twice in your feed.

The problem with perf stats is that I hate bloat, or even perceived bloat. Even when it doesn't affect me in any way, the concept of wasted cycles makes me really sad.

You probably can guess where this is going… I said, well, let's see what perf says about a simple "null" program. Surely doing nothing should be just a small number of instructions, right?

Note: I think that perf also records kernel-side code, because the lowest I could get was about ~50K instructions for starting a null program in assembler that doesn't use libc and just executes the syscall asm instruction. However, these ~50K instructions are noise the moment you start to use more high-level languages. Yes, this is expected, but the I was still shocked. And there's lots of delta between languages I'd expected to behave somewhat identical.

Again, this is not important in the real world. At all. They are just numbers, and probably the noise (due to short runtime) has lots of influence on the resulting numbers. And I might have screwed up the measurements somehow.

Test setup

Each program was the equivalent of 'exit 0' in the appropriate form for the language. During the measurements, the machine was as much as possible idle (single-user mode, measurements run at real-time priority, etc.). For compiled languages, -O2 was used. For scripts, a simple #!/path/to/interpreter (without options, except in the case of Python, see below) was used. Each program/script was run 500 times (perf's -r 500) and I've checked that the variations were small (±0.80% on the metrics I used).

You can find all the programs I've used at http://git.k1024.org/perf-null.git/, the current tests are for the tag version perf-null-0.1.

The raw data for the below tables/graphs is at log-4.

Results

Compiled languages

Language Cycles Instructions
asm 63K 51K
c-dietlibc 74K 57K
c-libc-static 177K 107K
c-libc-shared 506K 300K
c++-static 178K 107K
c++-dynamic 1,750K 1,675K
haskell-single 2,229K 1,338K
haskell-threaded 2,629K 1,522K
ocaml-bytecode 3,271K 2,741K
ocaml-native 1,042K 666K

Going from dietlibc to glibc doubles the number of instructions, and for libc going from static to dynamic linking again roughly doubles it. I didn't manage to compile a program dynamically-linked against dietlibc.

C++ is interesting. Linked statically, it is in the same ballpark as C, but when linked dynamically, it executes an order of magnitude (!) more instructions. I would guess that the initialisation of the standard C++ library is complex?

Haskell, which has a GC and quite a complex runtime, executes slightly less instructions than C++, but uses more cycles. Not bad, given the capabilities of the runtime. The two versions of the Haskell program are with the single-threaded runtime and with the multi-threaded one; not much difference. A fully statically-linked Haskell binary (not recommended usually) goes below 1M instructions, but not by much.

OCaml is a very nice surprise. The bytecode runtime is a bit slow to startup, but the (native) compiled version is quite fast to start: only 2× number of instructions and cycles compared to C, for an advanced language. And twice as fast as Haskell ☺. Nice!

Shells

Language Cycles Instructions
dash 766K 469K
bash 1,680K 1,044K
mksh 1,258K 942K
mksh-static 504K 322K

So, dash takes ~470K instructions to start, which is way below the C++ count and a bit higher than the C one. Hence, I'd guess that dash is implemented in C ☺.

Next, bash is indeed slower on startup than dash, and by slightly more than 2× (both instructions and cycles). So yes, switching /bin/sh from bash to dash makes sense.

I wasn't aware of mksh, so thanks for the comments. It is, in the static variant, more efficient that dash, by about 1.5×. However, the dynamically linked version doesn't look too great (dash is also dynamically linked; I would guess a statically-linked dash "beats" mksh-static).

Text processing

I've added perl here (even though it's a 'full' language) just for comparison; it's also in the next section.

Language Cycles Instructions
mawk 849K 514K
gawk 1,363K 980K
perl 2,946K 2,213K

A normal spread. I knew the reason why mawk is Priority: required is that it's faster than gawk, but I wouldn't have guessed it's almost twice as fast.

Interpreted languages

Here is where the fun starts…

Language Cycles Instructions
lua 5.1 1,947K 1,485K
lua 5.2 1,724K 1,335K
lua jit 1,209K 803K
perl 2,946K 2,213K
tcl 8.4 5,011K 4,552K
tcl 8.5 6,888K 6,022K
tcl 8.6 8,196K 7,236K
ruby 1.8 7,013K 6,128K
ruby 1.9.3 35,870K 35,022K
python 2.6 -S 11,752K 10,247K
python 2.7 -S 11,438K 10,198K
python 3.2 -S 29,003K 27,409K
pypy -S 21,106K 10,036K
python 2.6 25,143K 21,989K
python 2.7 47,325K 50,217K
python 2.7 -O 47,341K 50,185K
python 3.2 113,567K 124,133K
python 3.2 -O 113,424K 124,133K
pypy 90,779K 68,455K

The numbers here are not quite what I expected. There's a huge delta between the fastest (hi Lua!) and the slowest (bye Python!).

I wasn't familiar with Lua, so I tested it thanks to the comments. It is, I think, the only language which actually improves from one version to the next (bonus points), and where the JIT version also make is faster. In context, lua jit starts faster than C++.

Perl is the one that goes above C++'s instructions count, but not by much. From the point of view of the system, a Perl 'hello world' is only about 1.3×-1.6x slower than a C++ one. Not bad, not bad.

Next category is composed of TCL and Ruby, both of which had older versions 2-3× slower than Perl, but whose most recent versions are even more slower. TCL has an almost constant slowdown across versions (5M, 6.9M, 8.2M cycles), but Ruby seems to have taken a significant step backwards: 1.9.3 is 5× slower than 1.8. I wonder why? As for TCL, I didn't expect it to be slower to startup than Perl; good to know.

Last category is Python. Oh my. If you run perf stat python -c 'pass' you get some unbelievable numbers, like 50M instructions to do, well, nothing. Yes, it has a GC, yes, it does import modules at runtime, but still… On closer investigation, the site module and the imports it does do eat a lot of time. Running a simpler python -S brings it back to a more reasonable 10M instructions, which is in-line with the other interpreted languages.

However, even with the -S taken into account, Python also slows down across versions: a tiny improvement from 2.6 to 2.7, but (like Ruby) a 3× slowdown from 2.7 to 3.2. Trying the “optimised” version (-O) doesn't help at all. Trying pypy, which was based on Python 2.7, makes it around 2× slower to startup (both with and without -S).

So in the interpreted languages, it seems only Lua is trying to improve, the rest of the languages are piling up bloat with every version. Note: I should have tried multiple perl versions too.

Java

Java is in its own category; you guess why ☺, right?

GCJ was version 4.6, whereas by java below I mean OpenJDK Runtime Environment (IcedTea6 1.11) (6b24-1.11-4).

Language Cycles Instructions
null-gcj 97,156K 74,576K
java -jamvm 85,535K 80,102K
java -server 147,174K 136,803K
java -zero 132,967K 124,977K
java -cacao 229,799K 205,312K

Using gcj to compile to “native code” (not sure whether that's native-native or something else) results in a binary that uses less than 100M cycles to start, but the jamvm VM is faster than that (85M cycles). Not bad for java! Python 3.2 is slower to startup—yes, I think the world has gone crazy.

However, the other VMs are a few times slower: server (the default one) is ~150M cycles, and cacao is ~230M cycles. Wow.

The other thing about java is that it was the only one that couldn't be put nicely in a file that you just ‘exec’ (there is binfmt_misc indeed, but that doesn't allow different Java classes to use different Java VMs, so I don't count this), as opposed to every single other thing I tested here. Someone didn't grow on Unix?

Comparative analysis

Since there are almost 4 orders of magnitude difference between all the things tested here, a graph of cycles or instructions is not really useful. However, cycles/instruction, branches percentage and branches miss-predicted percentage can be. Hence first the cycles/instructions:

Cycles/instruction

Pypy is jumping out of the graph here, with the top value of over 2 cycles/instruction. Lua JIT is also bigger than Lua non-JIT, so maybe there's something to this (mostly joking, two data points don't make a series). On the other hand, Python wins as best cycles/instruction (0.91). Lots of ILP, to get below 1?

Java gets, irrespective of VM, consistently near 1.0-1.1. C++ gets very different numbers between static linking (1.666) and dynamic linking (1.045), whereas C has basically identical numbers. mksh also has a difference between dynamic and static linking. Hmm…

Ruby, TCL and Python have consistent values across versions.

And that's about what I can see from that graph. Next up, percentage of branches out of total instructions and percentage of branches missed:

Branch statistics

Note that the two lines shouldn't really be on the same graph; for the branch %, the 100% is the total instructions count, but for the branch miss %, the 100% is the total branch count. Anyway.

There are two low-value outliers:

  • dynamically-linked C++ has a low branch percentage (17.46%) and a very low branch miss percentage (only 4.32%)
  • gcj-compiled java has a very low branch miss percentage (only 2.82%!!!), even though is has a “regular” branch percentage (20.85%)

So it seems the gcj libraries are well optimised? I'm not familiar enough with this topic, but on the graph it does indeed stand out.

On the other end, mksh-static has a high branch miss percentage: 11.60%, which jumps clearly ahead of all the others; this might be why it has a high cycles/instruction count, due to all the stalls in misprediction; one has to wonder why it confuses the branch predictor?

I find it interesting that the overall branch count is very similar across languages, both when most of the cost is in the kernel (e.g. asm) and when the user-space cost heavily over-weighs the kernel (e.g. Java). The average is 20.85%, minimum is 17.46%, max 22.93%, standard deviation (if I used gnumeric correctly) is just 0.01. This seems a bit suspicious to me ☺. On the other hand, the mispredicted branches percentage varies much more: from a measly 2.82% to 11.60% (5x difference).

Summary

So to recap, counting just instructions:

  • going from dietlibc to glibc: 2× increase
  • going from statically-linked libc to dynamically-linked libc: doubles it again
  • going from C to C++: 5× increase
  • C++ to Perl: 1.3×
  • Perl to Ruby: 3×
  • Ruby to Python (-S): 1.6x
  • Python -S to regular Python: 5x
  • Python to Java: 1×-2×, depending on version/runtime
  • branch percentage (per total instructions) is quite consistent across all of the programs

Overall, you get roughly three orders of magnitude slower startup between a plain C program using dietlibc and Python. And all, to do basically nothing.

On the other hand, I learned some interesting things while doing it, so it wasn't quite for nothing ☺.

I think you've forgotten a zero in your second table, as bash has only 174K instructions, while the text says that it has twice the cycles of dash (which would make 1,740K quite accurate).
Comment by Anonymous late Saturday evening, February 11th, 2012

Please include lua in your table :)
I think it would fall snugly between dash and bash among the high-scorers, and so surely deserves a mention.

Comment by Anonymous Saturday night, February 11th, 2012
mksh would also fall between dash and bash I think; this is interesting as it suggests that dash still has the faster startup time, but mksh provides a fully interactive shell.
Comment by Anonymous Saturday night, February 11th, 2012

Hi, that's really interesting.

Care to share the source code of all the tiny programs you created to measure this? (Yeah I know, not much work to recreate, but still...)

Oh, and I'd really love to know numbers for

php -n -r 'exit(0);'

as well :)

Comment by Anonymous Saturday night, February 11th, 2012

Oh, and why is perf such a pain to install?

$ perf bash: perf: command not found $ aptitude search perf garbage google for perf $ aptitude install linux-base $ aptitue installl linux-tools-2.6.32 $ perf Error: perfcounter syscall returned with -1 (Function not implemented) Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

I'm giving up on this machine :|

Comment by Anonymous Saturday night, February 11th, 2012

Found a new winner! Or is this cheating?

$ env - perf stat /bin/mksh-static -c '' ... 712996 cycles 351826 instructions

This beat dash by almost 2x on my system.

Comment by Anonymous Saturday night, February 11th, 2012

Hi all,

I didn't expect so many comments :)

I'll try to add lua, php and mksh. As to bash, yes, the number is missing a zero indeed, good catch.

Comment by Iustin Pop late Saturday night, February 12th, 2012

… I was going to suggest mksh (as I suspect mksh-static beats all others by startup time, and mksh beats all but ksh93 and is about the same as dash on regular script execution time), but people have beat me to it while I was busy taking over the woWWDebian even more… ;-)

Funny. But yes, please do include more languages and variants (think tcl8.4 tcl8.5 jimtcl, things like that).

//mirabilos

Comment by Anonymous terribly early Sunday morning, February 12th, 2012

For what it’s worth, /bin/sh on my Debian/m68k VMs is mksh-static (and, except the problem Wouter found which was fixed in 40.2-2 during Debconf, works unproblematicly).

Sorry for posting twice. It’s terribly early Sunday morning.

//mirabilos

Comment by Anonymous terribly early Sunday morning, February 12th, 2012
Just pushed a new version, I hope I addressed the comments. And note: mksh-static versus dash is a not a fair comparison :)
Comment by Iustin Pop late Sunday afternoon, February 12th, 2012

Ah, but who cares about fair? There is no dash-static (but bash-static seems to be packaged, although separately). And mksh can do a lot more ;-)

FWIW: I’m working with klibc upstream to get mksh-static linked against that. The numbers are amazing, on i386 the size is about 130K static, 125K dynamic (dietlibc is 185K, and that is compiled with -DMKSH_SMALL). For comparison, mksh-static on kfreebsd-i386 is 700K thanks to glibc. Still faster startup, though.

Oh: dietlibc cannot do dynamic right now. Sorry about that. While it may be possible on i386, pkg-dietlibc decided against it, to get it consistent across architectures; it’s broken on most.

The C++ case is intriguing. I assume if you add echoing “Hello, World!\n” in two variants – the language’s fastest (things like use write(2) in C) or most natural (fprintf in C, iostreams in C++, etc) you’d also get interesting numbers.

Having something to do instead of just exitting would pull in more code (for compiled languages) and have at least the I/O and possibly variable and function call codepaths initialised, which you’d have practically always. (mksh does initialise almost everything always; while there’s not much to do, this saves on conditional execution later.)

I knew Python was slow, but that slow? õÕ

gcj surprises me. Native code is indeed the equivalent of “gcc -o foo foo.c”, but it pulls in parts of the Java™ standard library. And the compiler, since the language can do reflection (but shouldn’t it only pull that in when that feature is actually used?).

Vertically shaded lines would have been nice in the graphs. It’s hard to follow, especially as the writing is rotated and small.

The branch thing about mksh-static may be caused by -DMKSH_SMALL, which does not only ask the compiler using -Os (diet -Os gcc) for smaller code, but also disables inlining and changes several macros by called functions. Hrm. May very well not be good with todays’ compilers, whose -Os combined with superiour optimisation passes can shave off more. I think I will take this lesson and split -DMKSH_SMALL of into two (functionality and code structure) and disable the one for code structure on e.g. Debian/gcc. Let me express a thanks for doing those graphs. (As you can see, mksh code normally performs well better than dash and GNU bash. Or zsh… no idea how ksh93 would fare, I had a look at its source code once and understood absolutely zero.)

“I find it interesting that the overall branch count is very similar across languages” – mostly because kernel/userspace use a mostly homogenous compiler (or at least compiler technology). If you were to add in various compilers into the mix (Debian/i386 has several gcc versions, Fabrice Bellard’s tcc, TenDRA (not Multiarch-ready, sadly), llvm+Clang, used to have llvm-gcc I think; there’s SUNWcc for Linux-i386 and that commercial Intel compiler (both would probably not find their headers on a system with multiarsch’d includes, so I suggest using a lenny chroot for them), and pcc which also needs multiarch patches)… but then you’d have to compile everything yourself, multiple times. I don’t ask that, just offering a possible explanation.

//mirabilos

Comment by Anonymous late Sunday evening, February 12th, 2012

Hi! Seems you had plenty more fun with this today. Really great work; despite the simple nature of the test, this was a great introduction to the perf tool, and the results will surely lead to many further discussions (already the case of dash vs. mksh, increasing bloat of scripting languages, the awesomeness of Lua...).

Rather than make this project any more complex than it already is, what about contributing to the Computer Language Benchmark Game: http://shootout.alioth.debian.org/

That also compares based on code size, memory usage and CPU load. Maybe it could be extended to use the 'perf stat' output. And these new, previously-unthought-of test cases of 'do nothing' and 'hello world' perhaps could be added as new benchmarks (how did anyone miss these?)

Comment by Anonymous in the wee hours of Sunday night, February 13th, 2012

Hi, nice article, but I wonder: do you use Excel for your graphs? Can't be a tech article ..

Keep tryin'!

Best regards

Comment by Anonymous early Monday morning, February 13th, 2012
According to the perf wiki, events are measured at both kernel and user levels.
Comment by Anonymous early Monday morning, February 13th, 2012
The reason I did the graphs in gnumeric (not Excel) was simply because in gnuplot I don't know how to nicely use an x axis which is only labels (taken from a data file), instead of numbers. Simply as that :)
Comment by Iustin Pop late Monday evening, February 13th, 2012

Mirabilos, thanks a lot for your insightful comment. If my weekend fun resulted in an optimisation in mksh (build or code or…), then it was definitely worth the time.

I agree about the explanation for branch%, but still, I find it a bit surprising. Maybe even for "big programs", the contribution of the kernel is significant, hence its contributions skews the result? I only now learned about the ':u' specification for perf, I'll probably spend a bit of time re-doing the stats to see how that will change things.

Finally, sorry for the bad graphs. I'll know better next time!

Comment by Iustin Pop late Monday evening, February 13th, 2012

I'm guessing the reason why Haskell manages to pull out a better number of instructions than dynamic C++ comes from the fact that Haskell modules (in this case the runtime) are statically linked, which could result in better code optimization. Surely, this is pure speculation, but it might be worth investigating. Also, the Haskell single vs multi-threaded data seems pretty neat.

Also, I think it would be interesting (I'd say mandatory) to include the architecture/hardware specs on which the tests were done. It's relevant since the cycle/instruction ratio depends greatly on cache configuration and number of CPUs, not to mention the way cache maintenance is done in the OS. Also this would also be a good pretext to see how well a given toolchain (in this case gcc) performs on the given architecture. And since you're probably running the tests on a non-realtime OS, the results (in this case the number of cycles) vary with the number of programs you're running in background and such. So that's probably the main reason why an empty assembly program runs for such a big number of cycles in the first place - running the scheduler alone might lead to that overhead.

Comment by Lucian Monday night, February 13th, 2012

hi,

thanks for the nice article. From now on I'm going to use Lua if I need a scripting language to do nothing! ;)

greetings, ben

Comment by Anonymous in the wee hours of Monday night, February 14th, 2012

Haskell can profile itself somewhat using +RTS -sstderr. Running the null program that way, it says it allocated 47,496 bytes on the heap, copied 1,376 bytes during GC, and spent 4.7% of its time in GC. You seemed to think having a GC would be a significant contributor, but based on this, I don't think it is, for haskell.

ltrace is another amusing thing to look at with null programs. If you ltrace perl, you'll find it doing inane things like strlen("1") repeatedly and some other really weird stuff:

strncmp("DISPLAY=:0.0", "NoNe  SuCh", 10)        = -1

When I look at python's repeated strlen and memcpy of apparently every symbol in its standard library, I want to cry.

Also, BTW, there was a talk at LCA2012 called "Bloat How and Why UNIX Grew Up and Out" that's on-topic.

Comment by joey terribly early Tuesday morning, February 14th, 2012

Hi Joey - yes, I'm familiar with Haskell's RTS own stats. My point was more along the lines of: to initialise the GC structures and load the GC code will be more work than not having to do it. Not that GC itself will do lots of work for a null program. So I was trying to look at what's the startup cost for a GC (to do nothing afterwards).

I didn't think to look at why Python is so slow, thanks for the ltrace thing. I'm just scared now at the thought of having to migrate our project to Python 3…

Thanks for the LCA info, I'll try to look that up.

Comment by Iustin Pop Tuesday evening, February 14th, 2012

Lucian - ack on the static compiling of Haskell modules. I tried to build the application dynamically also (ghc -dynamic), however it seems that ghc 7.0.4 in unstable doesn't come with base in a dynamic form, so I wasn't able to.

I wanted to include the HW and SW versions in more details, but I forgot. I'll probably move this to a separate page (to not spam planet.debian.org every time I update it), but in the meantime:

  • cpu: AMD Phenom(tm) II X4 940 Processor (only 4 HW perf counters, that's why I gathered only 4 metrics)
  • lots of free memory during tests, so there shouldn't have been any memory pressure
  • all tests were limited to CPU 0 (for repeatability)

Hope this helps!

Comment by Iustin Pop Tuesday evening, February 14th, 2012

Hello,

Could you try JavaScript too? For example run a script with node.js. Also it would be interesting to know in numbers how much CoffeeScript adds to JavaScript and node.js.

Comment by Anonymous late Friday evening, February 24th, 2012
Comments on this page are closed.