2 Tags: en, d, dgc, performance, wtf, gc, voronoi, make, memory layout
4 How do I start describing this problem? Let's try to do it in chronological
11 I've collected a bunch of little programs to use as a benchmark suite for the
12 `garbage collector`__ for `my thesis`__. I was running only a few manually each
13 time I've made a change to the GC to see how things were going (I didn't want to
14 make changes that degrade the performance). A little tired of this (and missing
15 the point of having several tests using just a few), I've decided to build
16 a Makefile__ to compile the programs, run the tests and generate some graphs
17 with the timings to compare the performance against the current D__ GC__
20 __ /proj/dgc/index.html
21 __ http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29
22 __ http://en.wikipedia.org/wiki/Makefile
23 __ http://digitalmars.com/d/index.html
24 __ http://digitalmars.com/d/1.0/garbage.html
25 __ http://www.dsource.org/projects/tango
31 When done, I noticed a particular test that was notably slower in my
32 implementation (it went from ~3 seconds to ~5 seconds). Here is the result (see
33 the voronoi__ test, if you can read the labels, there is some overlapping
34 because my effort to improve the graph was truncated by this issue :).
36 __ http://codepad.org/xGDCS3KO
38 .. image:: ##POST_URL##/13-time.png
39 :target: ##POST_URL##/13-time.svg
44 But I didn't recall it being that way when running the test manually. So I ran
45 the test manually again, and it took ~3 seconds, not ~5. So I started to dig
46 where the difference came from. You'll be surprised by my findings, **the
47 difference came from executing the tests inside the Makefile!**
49 Yes, take a look at this (please note that I've removed all output from the
50 ``voronoi`` program, the only change I've made)::
52 $ /usr/bin/time -f%e ./voronoi -n 30000
54 $ echo 'all:' > Makefile
55 $ echo -e '\t$C' >> Makefile
56 $ make C="/usr/bin/time -f%e ./voronoi -n 30000"
57 /usr/bin/time -f%e ./voronoi -n 30000
61 This is not just one isolated run, I've tried hundreds of runs and the results
62 are reproducible and stable.
68 I don't remember exactly how I started, but early enough, noticing that the
69 Tango's basic__ GC didn't suffered from that problem, and being my GC based on
70 that one, I bisected__ my repository to see what was introducing such behaviour.
71 The offending patch was `removing the difference between committed and
72 uncommitted pages in pools`__. I can see that this patch could do more harm
73 than good now (I didn't tried the benchmark when I did that change I think),
74 because more pages are looped when working with pools, but I can't see how this
75 would affect only the program when it's executed by Make!!!
77 __ http://www.dsource.org/projects/tango/browser/tags/releases/0.99.9/tango/core/rt/gc/basic
78 __ http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html
79 __ http://pastebin.com/1y8si1sc
81 I had a patch that made thing really nasty but not a clue **why** they were
82 nasty. I've tried **everything**. First, the obvious: use nice__ and ionice__
83 (just in case) to see if I was just being unlucky with the system load (very
84 unlikely since I did hundreds of runs in different moments, but still). No
87 __ http://opengroup.org/onlinepubs/007908775/xcu/nice.html
88 __ http://linux.die.net/man/1/ionice
90 I've tried running it on another box. Mine is a Quad-Core, so I've tried the
91 Dual-Core from work and I had the same problem, only the timing difference were
92 a little smaller (about ~4.4 seconds), so I thought it might be something to do
93 to with the multi-cores, so I've tried it in a single core, but the problem was
94 the same (~10.5 seconds inside make, ~7 outside). I've tried with taskset__ in
95 the multi-core boxes too. I've tried putting all the CPUs with the performance
96 governor using cpufreq-set__ too, but didn't help.
98 __ http://linux.die.net/man/1/taskset
99 __ http://www.kernel.org/pub/linux/utils/kernel/cpufreq/cpufreq-set.html
101 Since I'm using DMD__, which works only in 32 bits `for now`__, and since my
102 box, and the box at work are both 64 bits, I suspected from that too, but the
103 old AMD is 32 bits and I see the problem there too.
105 __ http://www.dsource.org/projects/dmd
106 __ /blog/blog/post/-41d8f86f
108 I've tried valgrind__ + callgrind__ + kcachegrind__ but it seems like valgrind
109 emulation is not affected by whatever difference is when the program is ran
110 inside make because the results for the run inside__ and outside__ make were
113 __ http://valgrind.org/
114 __ http://valgrind.org/info/tools.html#callgrind
115 __ http://kcachegrind.sourceforge.net/cgi-bin/show.cgi/KcacheGrindIndex
116 __ ##POST_URL##/13-callgrind.out.cdgc-9fac8d7-make
117 __ ##POST_URL##/13-callgrind.out.cdgc-9fac8d7-nomake
119 I've tried env__ -i, just in case some weird environment variable was making the
120 difference, but nothing.
122 __ http://linux.die.net/man/1/env
124 I've tried strace__ too, to see if I spotted anything weird, and I saw a couple
125 of weird things (like the addresses returned by mmap__ being suspiciously very
126 different), but nothing too concrete (but I think inspecting the strace results
127 more thoughtfully might be one of the most fertile paths to follow). I took
128 a look at the timings of the syscalls and there was nothing taking too much
129 time, most of the time is spent in the programs calculations.
131 __ http://en.wikipedia.org/wiki/Strace
132 __ http://www.opengroup.org/onlinepubs/000095399/functions/mmap.html
135 So I'm really lost here. I still have no idea where the difference could come
136 from, and I guess I'll have to run the tests from a separate shell script
137 instead of directly inside make because of this. I'll ask to the make developers
138 about this, my only guess is that maybe make is doing some trickery with the
139 scheduler of something like that for the -j option. And I'll take a look to the
140 offending patch too, to see if the performance was really degraded and maybe
141 I'll revert it if it does, no matter what happen with this issue.
144 If you have any ideas on what could be going on, **anything**, please let me
145 know (in a comment of via e-mail__). Thanks :)
147 __ mailto: luca@llucax.com
150 .. admonition:: Update
152 I've post__\ ed this to the `Make mailing list`__, but unfortunately didn't
153 got any useful answer. Thanks anyway to all the people that replied with nice
156 __ http://thread.gmane.org/gmane.comp.gnu.make.general/8974
157 __ http://lists.gnu.org/mailman/listinfo/help-make
160 .. admonition:: Update
162 Thanks Alb for the investigation, that was a 1/4kg of ice-cream well earned
165 A couple of notes about his findings. An easy way to trigger this behaviour
166 is using the command setarch__, the option ``-L`` changes the memory layout
167 to ``ADDR_COMPAT_LAYOUT``, see the `commit that introduced the new layout`__
170 The call to ``setrlimit(RLIMIT_STACK, RLIM_INFINITY)`` by Make (which has
171 a reason__) triggers that behaviour too because the new layout can't have an
172 *unlimited* stack, so using ulimit__ (``ulimit -s unlimited``) causes the
175 The same way, if you type ulimit -s 8192 ./voronoi as a command in
176 a Makefile, the effect is reverted and the command behaves as outside the
179 Part of the mystery is solved, but a question remains: why the test is so
180 address-space-layout dependant? It smells like a GC bug (present in the basic
181 GC too, as other tests I've done show the same odd behaviour, less visibly,
182 but still, probably because of the removal of the distinction between
183 committed and uncommitted memory patch).
185 __ http://linux.die.net/man/8/setarch
186 __ http://lwn.net/Articles/90311/
187 __ http://thread.gmane.org/gmane.comp.gnu.make.bugs/4730
188 __ http://www.opengroup.org/onlinepubs/000095399/utilities/ulimit.html
191 .. admonition:: Update
193 Last update, I promise! =)
195 I think I know what is adding the extra variance when the memory layout is
196 randomized: **false pointers**.
198 Since the GC is conservative__, data is usually misinterpreted as pointers.
199 It seems that are address spaces that makes much more likely that simple data
200 is misinterpreted as a valid pointer, at least for the voronoi test. This is
201 consistent with other tests. Tests with random data notably increases their
202 variance among runs and are pretty stable when the memory layout is not
205 I'll try to give the patch to `integrate precise heap scanning`__ a try, and
206 see if it improves things.
208 What remains a mystery is what happened with the committed memory
209 distinction, now I can't reproduce the results. I made so many measures and
210 changes, that maybe I just got lost in a bad measure (for example, with the
211 CPU using the *ondemand* governor). I've tried again the tests with and
212 without that change and the results are pretty the same (a little better for
213 the case with the distinction, but a really tiny difference indeed).
215 Well, that's all for now, I'll give this post a rest =)
217 __ http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29#Precise_vs._conservative_and_internal_pointers
218 __ http://d.puremagic.com/issues/show_bug.cgi?id=3463
221 .. admonition:: Update
223 Don't believe me, ever! =P
225 I just wanted to say that's is confirmed, the high variance in the timings
226 when heap randomization is used is because of *false pointers*. See this
227 comment__ for more details.
229 __ http://d.puremagic.com/issues/show_bug.cgi?id=3463#c45
232 .. vim: set et sw=3 sts=3 :