source/blog/posts/2010/07/13-performance-wtf.rst

   1 Title: Performance WTF
   2 Tags: en, d, dgc, performance, wtf, gc, voronoi, make, memory layout
   3
   4 How do I start describing this problem? Let's try to do it in chronological
   5 order...
   6
   7
   8 Introduction
   9 ------------
  10
  11 I've collected a bunch of little programs to use as a benchmark suite for the
  12 `garbage collector`__ for `my thesis`__. I was running only a few manually each
  13 time I've made a change to the GC to see how things were going (I didn't want to
  14 make changes that degrade the performance). A little tired of this (and missing
  15 the point of having several tests using just a few), I've decided to build
  16 a Makefile__ to compile the programs, run the tests and generate some graphs
  17 with the timings to compare the performance against the current D__ GC__
  18 (Tango__ really).
  19
  20 __ http://www.llucax.com.ar/proj/dgc/index.html
  21 __ http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29
  22 __ http://en.wikipedia.org/wiki/Makefile
  23 __ http://digitalmars.com/d/index.html
  24 __ http://digitalmars.com/d/1.0/garbage.html
  25 __ http://www.dsource.org/projects/tango
  26
  27
  28 The Problem
  29 -----------
  30
  31 When done, I noticed a particular test that was notably slower in my
  32 implementation (it went from ~3 seconds to ~5 seconds). Here is the result (see
  33 the voronoi__ test, if you can read the labels, there is some overlapping
  34 because my effort to improve the graph was truncated by this issue :).
  35
  36 __ http://codepad.org/xGDCS3KO
  37
  38 .. image:: ##POST_URL##/13-time.png
  39    :target: ##POST_URL##/13-time.svg
  40    :align: center
  41    :width: 512
  42    :height: 384
  43
  44 But I didn't recall it being that way when running the test manually. So I ran
  45 the test manually again, and it took ~3 seconds, not ~5. So I started to dig
  46 where the difference came from. You'll be surprised by my findings, **the
  47 difference came from executing the tests inside the Makefile!**
  48
  49 Yes, take a look at this (please note that I've removed all output from the
  50 ``voronoi`` program, the only change I've made)::
  51
  52    $ /usr/bin/time -f%e ./voronoi -n 30000
  53    3.10
  54    $ echo 'all:' > Makefile
  55    $ echo -e '\t$C' >> Makefile
  56    $ make C="/usr/bin/time -f%e ./voronoi -n 30000"
  57    /usr/bin/time -f%e ./voronoi -n 30000
  58    5.11
  59    $
  60
  61 This is not just one isolated run, I've tried hundreds of runs and the results
  62 are reproducible and stable.
  63
  64
  65 Further Investigation
  66 ---------------------
  67
  68 I don't remember exactly how I started, but early enough, noticing that the
  69 Tango's basic__ GC didn't suffered from that problem, and being my GC based on
  70 that one, I bisected__ my repository to see what was introducing such behaviour.
  71 The offending patch was `removing the difference between committed and
  72 uncommitted pages in pools`__. I can see that this patch could do more harm
  73 than good now (I didn't tried the benchmark when I did that change I think),
  74 because more pages are looped when working with pools, but I can't see how this
  75 would affect only the program when it's executed by Make!!!
  76
  77 __ http://www.dsource.org/projects/tango/browser/tags/releases/0.99.9/tango/core/rt/gc/basic
  78 __ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
  79 __ http://pastebin.com/1y8si1sc
  80
  81 I had a patch that made thing really nasty but not a clue **why** they were
  82 nasty. I've tried **everything**. First, the obvious: use nice__ and ionice__
  83 (just in case) to see if I was just being unlucky with the system load (very
  84 unlikely since I did hundreds of runs in different moments, but still). No
  85 change.
  86
  87 __ http://opengroup.org/onlinepubs/007908775/xcu/nice.html
  88 __ http://linux.die.net/man/1/ionice
  89
  90 I've tried running it on another box. Mine is a Quad-Core, so I've tried the
  91 Dual-Core from work and I had the same problem, only the timing difference were
  92 a little smaller (about ~4.4 seconds), so I thought it might be something to do
  93 to with the multi-cores, so I've tried it in a single core, but the problem was
  94 the same (~10.5 seconds inside make, ~7 outside). I've tried with taskset__ in
  95 the multi-core boxes too. I've tried putting all the CPUs with the performance
  96 governor using cpufreq-set__ too, but didn't help.
  97
  98 __ http://linux.die.net/man/1/taskset
  99 __ http://www.kernel.org/pub/linux/utils/kernel/cpufreq/cpufreq-set.html
 100
 101 Since I'm using DMD__, which works only in 32 bits `for now`__, and since my
 102 box, and the box at work are both 64 bits, I suspected from that too, but the
 103 old AMD is 32 bits and I see the problem there too.
 104
 105 __ http://www.dsource.org/projects/dmd
 106 __ http://www.llucax.com.ar/blog/blog/post/-41d8f86f
 107
 108 I've tried valgrind__ + callgrind__ + kcachegrind__ but it seems like valgrind
 109 emulation is not affected by whatever difference is when the program is ran
 110 inside make because the results for the run inside__ and outside__ make were
 111 almost identical.
 112
 113 __ http://valgrind.org/
 114 __ http://valgrind.org/info/tools.html#callgrind
 115 __ http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindIndex
 116 __ ##POST_URL##/13-callgrind.out.cdgc-9fac8d7-make
 117 __ ##POST_URL##/13-callgrind.out.cdgc-9fac8d7-nomake
 118
 119 I've tried env__ -i, just in case some weird environment variable was making the
 120 difference, but nothing.
 121
 122 __ http://linux.die.net/man/1/env
 123
 124 I've tried strace__ too, to see if I spotted anything weird, and I saw a couple
 125 of weird things (like the addresses returned by mmap__ being suspiciously very
 126 different), but nothing too concrete (but I think inspecting the strace results
 127 more thoughtfully might be one of the most fertile paths to follow). I took
 128 a look at the timings of the syscalls and there was nothing taking too much
 129 time, most of the time is spent in the programs calculations.
 130
 131 __ http://en.wikipedia.org/wiki/Strace
 132 __ http://www.opengroup.org/onlinepubs/000095399/functions/mmap.html
 133
 134
 135 So I'm really lost here. I still have no idea where the difference could come
 136 from, and I guess I'll have to run the tests from a separate shell script
 137 instead of directly inside make because of this. I'll ask to the make developers
 138 about this, my only guess is that maybe make is doing some trickery with the
 139 scheduler of something like that for the -j option. And I'll take a look to the
 140 offending patch too, to see if the performance was really degraded and maybe
 141 I'll revert it if it does, no matter what happen with this issue.
 142
 143
 144 If you have any ideas on what could be going on, **anything**, please let me
 145 know (in a comment of via e-mail__). Thanks :)
 146
 147 __ mailto: luca@llucax.com.ar
 148
 149
 150 .. admonition:: Update
 151
 152    I've post__\ ed this to the `Make mailing list`__, but unfortunately didn't
 153    got any useful answer. Thanks anyway to all the people that replied with nice
 154    suggestions!
 155
 156 __ http://thread.gmane.org/gmane.comp.gnu.make.general/8974
 157 __ http://lists.gnu.org/mailman/listinfo/help-make
 158
 159
 160 .. admonition:: Update
 161
 162    Thanks Alb for the investigation, that was a 1/4kg of ice-cream well earned
 163    =P
 164
 165    A couple of notes about his findings. An easy way to trigger this behaviour
 166    is using the command setarch__, the option ``-L`` changes the memory layout
 167    to ``ADDR_COMPAT_LAYOUT``, see the `commit that introduced the new layout`__
 168    for more details.
 169
 170    The call to ``setrlimit(RLIMIT_STACK,  RLIM_INFINITY)`` by Make (which has
 171    a reason__) triggers that behaviour too because the new layout can't have an
 172    *unlimited* stack, so using ulimit__ (``ulimit -s unlimited``) causes the
 173    same behaviour.
 174
 175    The same way, if you type ulimit -s 8192 ./voronoi as a command in
 176    a Makefile, the effect is reverted and the command behaves as outside the
 177    Makefile.
 178
 179    Part of the mystery is solved, but a question remains: why the test is so
 180    address-space-layout dependant? It smells like a GC bug (present in the basic
 181    GC too, as other tests I've done show the same odd behaviour, less visibly,
 182    but still, probably because of the removal of the distinction between
 183    committed and uncommitted memory patch).
 184
 185 __ http://linux.die.net/man/8/setarch
 186 __ http://lwn.net/Articles/90311/
 187 __ http://thread.gmane.org/gmane.comp.gnu.make.bugs/4730
 188 __ http://www.opengroup.org/onlinepubs/000095399/utilities/ulimit.html
 189
 190
 191 .. admonition:: Update
 192
 193    Last update, I promise! =)
 194
 195    I think I know what is adding the extra variance when the memory layout is
 196    randomized: **false pointers**.
 197
 198    Since the GC is conservative__, data is usually misinterpreted as pointers.
 199    It seems that are address spaces that makes much more likely that simple data
 200    is misinterpreted as a valid pointer, at least for the voronoi test. This is
 201    consistent with other tests. Tests with random data notably increases their
 202    variance among runs and are pretty stable when the memory layout is not
 203    randomized.
 204
 205    I'll try to give the patch to `integrate precise heap scanning`__ a try, and
 206    see if it improves things.
 207
 208    What remains a mystery is what happened with the committed memory
 209    distinction, now I can't reproduce the results. I made so many measures and
 210    changes, that maybe I just got lost in a bad measure (for example, with the
 211    CPU using the *ondemand* governor). I've tried again the tests with and
 212    without that change and the results are pretty the same (a little better for
 213    the case with the distinction, but a really tiny difference indeed).
 214
 215    Well, that's all for now, I'll give this post a rest =)
 216
 217 __ http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Precise_vs._conservative_and_internal_pointers
 218 __ http://d.puremagic.com/issues/show_bug.cgi?id=3463
 219
 220
 221 .. admonition:: Update
 222
 223    Don't believe me, ever! =P
 224
 225    I just wanted to say that's is confirmed, the high variance in the timings
 226    when heap randomization is used is because of *false pointers*. See this
 227    comment__ for more details.
 228
 229 __ http://d.puremagic.com/issues/show_bug.cgi?id=3463#c45
 230
 231
 232 .. vim: set et sw=3 sts=3 :