2 Tags: en, d, dgc, naive, gc, debug, gdb, dil, asm, statistics, benchmark, dgcbench
4 .. warning:: Long post ahead =)
6 As I said before, `debug is hell`__ in D_, at least if you're using
7 a compiler that doesn't write proper debug information and you're writing
8 a garbage collector. But you have to do it when things go wrong. And things
11 __ /blog/blog/post/-288428a1
12 .. _D: http://www.digitalmars.com/d/
14 This is a small chronicle about how I managed to debug a weird problem =)
16 I had my `Naive GC`_ working and getting `good stats`__ with some small
17 micro-benchmarks_, so I said *let's benchmark something real*. There is almost
18 no real D_ applications out there, suitable for an automated GC benchmark at
19 least [#benchapp]_. Dil_ looked like a good candidate so I said *let's use Dil
20 in the benchmark suite!*.
22 .. _`Naive GC`: https://git.llucax.com/w/software/dgc/naive.git
23 __ /blog/blog/post/37e6cb2a
24 .. _micro-benchmarks: https://git.llucax.com/w/software/dgc/dgcbench.git
25 .. _Dil: http://code.google.com/p/dil
27 And I did. But Dil_ didn't work as I expected. Even when running it without
28 arguments, in which case a nice help message like this should be displayed::
31 Copyright (c) 2007-2008 by Aziz Köksal. Licensed under the GPL3.
45 Type 'dil help <subcommand>' for more help on a particular subcommand.
47 Compiled with Digital Mars D v1.041 on Sat Aug 29 18:04:34 2009.
51 Generate an XML or HTML document from a D source file.
53 dil gen file.d [Options]
56 --syntax : generate tags for the syntax tree
57 --xml : use XML format (default)
58 --html : use HTML format
61 dil gen Parser.d --html --syntax > Parser.html
63 Which it isn't even a valid Dil_ command (it looks like a *dead* string in some
64 ``data/lang_??.d`` files__).
66 __ http://github.com/azizk/dil/blob/e284797bb48769a67f4b3d2d61f3bc0b4fcf5011/data/lang_en.d#L77
68 I ran Valgrind_ on it and detected a suspicious invalid read of size 4 when
69 reading the last byte of a 13 bytes long class instance. I thought maybe the
70 compiler was assuming the GC allocated block with size multiples of the word
71 size, so I made `gc_malloc()`_ `allocate multiples of the word size`__, but
72 nothing happened. Then I thought that maybe the memory blocks should be aligned
73 to a multiple of a word, so I made `gc_malloc()`_ `align the data portion of
74 the cell to a multiple of a word`__, but nothing.
76 .. _Valgrind: http://valgrind.org/
77 .. _`gc_malloc()`: https://git.llucax.com/w/software/dgc/naive.git/blob/2d8639409b4749afd92266347f20b99da80e14c9:/gc/gc.d#l502
78 __ https://git.llucax.com/w/software/dgc/naive.git/commitdiff/39707a88bed116fbda8c590c9b6ff9193f30f58f
79 __ https://git.llucax.com/w/software/dgc/naive.git/commitdiff/39707a88bed116fbda8c590c9b6ff9193f30f58f
81 Since Valgrind_ only detected that problem, which was at the static constructor
82 of the module ``tango.io.Console``, I though it might be a Tango_ bug, so
83 I `report`__\ ed it. But it wasn't Tango_\ 's fault. The invalid read looked
84 like a DMD 1.042 bug; DMD 1.041 didn't have that problem, but my collector
85 still failed to run Dil_\ . So I was back to zero.
87 .. _Tango: http://www.dsource.org/projects/tango
88 __ http://www.dsource.org/projects/tango/ticket/1734
90 I tried the `Tango stub collector`__ and it worked, so I tried mine disabling
91 the collections, and it worked too. So finally I could narrow the problem to
92 the collection phase (which isn't much, but it's something). The first thing
93 I could think it could be wrong in a collection is that cells still in use are
94 swept as if they were unused, so I then disabled the sweep phase only, and it
97 __ http://www.dsource.org/projects/tango/browser/tags/releases/0.99.8/lib/gc/stub/gc.d
99 So, everything pointer to prematurely freed cells. But why my collector was
100 freeing cells prematurely being so, so simple? I reviewed the code a couple of
101 times and couldn't find anything evidently wrong. To confirm my theory and with
102 the hope of getting some extra info, I decided to `write a weird pattern`__ in
103 the swept cells and then check if that pattern was intact when giving them back
104 to the mutator (the `basic GC`__ can do that too if compiled with
105 ``-debug=MEMSTOMP``). That would confirm that the swept memory were still in
108 __ https://git.llucax.com/w/software/dgc/naive.git/commitdiff/bc664a95db5c7bab194f2e64cf96b6f092c5e60c
109 __ http://www.dsource.org/projects/tango/browser/tags/releases/0.99.8/lib/gc/basic
111 The I tried this modified GC with *memory stomp* with my micro-benchmarks_ and
112 they worked just fine, so I started to doubt again that it was my GC's problem.
113 But since those benchmarks didn't use much of the GC API, I thought maybe Dil_
114 was using some strange features of making some assumptions that were only true
115 for the current implementation, so I asked Aziz Köksal (Dil_ creator) and he
116 pointed me to some `portion of code`__ that allocated memory from the C heap,
117 overriding the operators ``new`` and ``delete`` for the ``Token`` struct. There
118 is a bug__ in Dil_ there, because apparently that struct store pointers to the
119 GC heap but it's not registered as a root, so it looks like a good candidate.
121 __ http://github.com/azizk/dil/blob/e284797bb48769a67f4b3d2d61f3bc0b4fcf5011/src/dil/lexer/Token.d#L262
122 __ http://code.google.com/p/dil/issues/detail?id=11
124 So I commented out the overridden ``new`` and ``delete`` operators, so the
125 regular GC-based operators were used. But I still got nothing, the wrong help
126 message were printed again. Then I saw that Dil_ was manually freeing memory
127 using ``delete``. So I decided to make my `gc_free()`_ implementation a ``NOP``
128 to let the GC take over of **all** memory management... And *finally* all
129 [#all]_ worked out fine! =)
131 So, the problem should be either my `gc_free()`_ implementation (which is
132 really simple) or a Dil_ bug.
134 .. _`gc_free()`: https://git.llucax.com/w/software/dgc/naive.git/blob/2d8639409b4749afd92266347f20b99da80e14c9:/gc/gc.d#l661
136 In order to get some extra information on where the problem is, I changed the
137 `Cell.alloc()`_ implementation to use mmap_ to `allocate whole pages`__, one
138 for the cell's header, and one or more for the cell *data*. This way, could
139 easily mprotect_ the cell *data* when the cell was swept (and un\ -mprotect_\
140 ing them when they were give back to the program) in order to make Dil_
141 segfault exactly where the freed memory was used.
143 .. _`Cell.alloc()`: https://git.llucax.com/w/software/dgc/naive.git/blob/346f04b3e1026ded5af3ac95a483b2c3d05a6d60:/gc/cell.d#l86
144 .. _mmap: http://www.kernel.org/doc/man-pages/online/pages/man2/mmap.2.html
145 __ https://git.llucax.com/w/software/dgc/naive.git/commitdiff/cc637a89be9b51945246bd65fa15080464a90aca
146 .. _mprotect: http://www.kernel.org/doc/man-pages/online/pages/man2/mprotect.2.html
148 I ran Dil_ using strace_ and this is what happened::
151 (a) write(1, "Cell.alloc(80)\n", 15) = 15
152 (b) mmap2(NULL, 8192, PROT_READ|PROT_WRITE, ...) = 0xb7a2e000
154 (c) mprotect(0xb7911000, 4096, PROT_NONE) = 0
155 mprotect(0xb7913000, 4096, PROT_NONE) = 0
157 mprotect(0xb7a2b000, 4096, PROT_NONE) = 0
158 mprotect(0xb7a2d000, 4096, PROT_NONE) = 0
159 (d) mprotect(0xb7a2f000, 4096, PROT_NONE) = 0
160 mprotect(0xb7a43000, 4096, PROT_NONE) = 0
161 mprotect(0xb7a3d000, 4096, PROT_NONE) = 0
163 mprotect(0xb7a6b000, 4096, PROT_NONE) = 0
164 (e) mprotect(0xb7a73000, 4096, PROT_NONE) = 0
165 (f) mprotect(0xb7a73000, 4096, PROT_READ|PROT_WRITE) = 0
166 mprotect(0xb7a6b000, 4096, PROT_READ|PROT_WRITE) = 0
168 mprotect(0xb7a3f000, 4096, PROT_READ|PROT_WRITE) = 0
169 (g) mprotect(0xb7a3d000, 4096, PROT_READ|PROT_WRITE) = 0
170 --- SIGSEGV (Segmentation fault) @ 0 (0) ---
171 +++ killed by SIGSEGV (core dumped) +++
173 .. _strace: http://en.wikipedia.org/wiki/Strace
175 **(a)** is a debug print, showing the size of the `gc_malloc()`_ call that got
176 the address ``0xb7a2e000``. The mmap_ **(b)** is ``8192`` bytes in size because
177 I allocate a page for the cell header (for internal GC information) and another
178 separated page for the data (so I can only mprotect_ the data page and keep the
179 header page read/write); that allocation asked for a new fresh couple of pages
180 to the OS (that's why you see a mmap_).
182 From **(c)** to **(e)** you can see a sequence of several mprotect_, that are
183 cells being swept by a collection (protecting the cells against read/write so
184 if the mutator tries to touch them, a ``SIGSEGV`` is on the way).
186 From **(f)** to **(g)** you can see another sequence of mprotect_, this time
187 giving the mutator permission to touch that pages, so that's `gc_malloc()`_
188 recycling the recently swept cells.
190 **(d)** shows the cell allocated in **(a)** being swept. Why the address is not
191 the same (this time is ``0xb7a2f000`` instead of ``0xb7a2e000``)? Because, as
192 you remember, the first page is used for the cell header, so the data should be
193 at ``0xb7a2e000 + 4096``, which is exactly ``0xb7a2f000``, the start of the
194 memory block that the sweep phase (and `gc_free()`_ for that matter) was
197 Finally we see the program getting his nice ``SIGSEGV`` and dumping a nice
198 little core_ for touching what it shouldn't.
200 .. _core: http://en.wikipedia.org/wiki/Core_dump
202 Then I opened the core_ with GDB_ and did something like this [#demangle]_::
204 Program terminated with signal 11, Segmentation fault.
205 (a) #0 0x08079a96 in getDispatchFunction ()
207 (b) $1 = (void (*)()) 0x8079a96 <getDispatchFunction+30>
208 (gdb) disassemble $pc
209 Dump of assembler code for function
211 0x08079a78 <getDispatchFunction+0>: push %ebp
212 0x08079a79 <getDispatchFunction+1>: mov %esp,%ebp
213 0x08079a7b <getDispatchFunction+3>: sub $0x8,%esp
214 0x08079a7e <getDispatchFunction+6>: push %ebx
215 0x08079a7f <getDispatchFunction+7>: push %esi
216 0x08079a80 <getDispatchFunction+8>: mov %eax,-0x4(%ebp)
217 0x08079a83 <getDispatchFunction+11>: mov -0x4(%ebp),%eax
218 0x08079a86 <getDispatchFunction+14>: call 0x80bccb4 <objectInvariant>
219 0x08079a8b <getDispatchFunction+19>: push $0xb9
220 0x08079a90 <getDispatchFunction+24>: mov 0x8(%ebp),%edx
221 0x08079a93 <getDispatchFunction+27>: add $0xa,%edx
222 (c) 0x08079a96 <getDispatchFunction+30>: movzwl (%edx),%ecx
227 .. _GDB: http://www.gnu.org/software/gdb/
229 First, in **(a)**, GDB_ tells where the program received the ``SIGSEGV``. In
230 **(b)** I print the program counter register to get a more readable hint on
231 where the program segfaulted. It was at ``getDispatchFunction+30``, so
232 I disassemble that function to see that the ``SIGSEGV`` was received when doing
233 ``movzwl (%edx),%ecx`` (moving the contents of the ``ECX`` register to the
234 memory pointed to by the address in the register ``EDX``) at **(c)**. In
235 **(d)** I get the value of the ``EDX`` register, and it's ``0xb7a2f000``. Do
236 you remember that value? Is the data address for the cell at ``0xb7a2e000``,
237 the one that was recently swept (and mprotect_\ ed). That's not good for
240 This is the offending method (at `dil/src/ast/Visitor.d`__)::
242 Node function(Visitor, Node) getDispatchFunction()(Node n)
244 return cast(Node function(Visitor, Node))dispatch_vtable[n.kind];
247 __ http://github.com/azizk/dil/blob/e284797bb48769a67f4b3d2d61f3bc0b4fcf5011/src/dil/ast/Visitor.d
249 Since I can't get any useful information from GDB_ (I can't even get a proper
250 backtrace [#backtrace]_) except for the mangled function name (because the
251 wrong debug information produced by DMD), I had to split that function into
252 smaller functions to confirm that the problem was in ``n.kind`` (I guess
253 I could figure that out by eating some more assembly, but I'm not *that* well
254 trained at eating asm yet =). This means that the ``Node`` instance ``n`` is
255 the one prematurely freed.
257 This is particularly weird, because it looks like the node is being swept, not
258 prematurely freed using an explicit ``delete``. So it seems like the GC is
259 missing some roots (or there are non-aligned pointers or weird stuff like
260 that). The fact that this works fine with the Tango_ basic collector is
261 intriguing too. One thing I can come up with to explain why it works in the
262 basic collector is because it makes a lot less collections than the `naive GC`
263 (the latter is really lame =). So maybe the *rootless* object becomes really
264 free before the basic collector has a chance to run a collection and because of
265 that the problem is never detected.
267 I spent over 10 days now investigating this issue (of course this is not near
268 a full-time job for me so I can only dedicate a couple of days a week to this
269 =), and I still can't find a clear cause for this problem, but I'm a little
270 inclined towards a Dil_ bug, so I report__\ ed one =). So we'll see how this
271 evolves; for now I'll just make `gc_free()`_ a NOP to continue my testing...
273 __ http://code.google.com/p/dil/issues/detail?id=12
276 .. [#benchapp] Please let me know if you have any working, real, Tango-based
277 D_ application suitable for GC benchmarks (i.e., using the GC and easily
278 scriptable to run it automatically).
280 .. [#all] *all* being running Dil_ without arguments to get the right help
283 .. [#demangle] I have shortened the name of the functions because they were
284 huge, cryptic, mangled names =). The real name of ``getDispatchFunction`` is
285 ``_D3dil3ast7Visitor7Visitor25__T19getDispatchFunctionZ19getDispatchFunctionMFC3dil3ast4Node4NodeZPFC3dil3ast7Visitor7VisitorC3dil3ast4Node4NodeZC3dil3ast4Node4Node``
286 (is not much better when demangled: ``class dil.ast.Node.Node function(class
287 dil.ast.Visitor.Visitor, class dil.ast.Node.Node)*
288 dil.ast.Visitor.Visitor.getDispatchFunction!().getDispatchFunction(class
289 dil.ast.Node.Node)`` =). The real name of ``objectInvariant`` is
290 ``D9invariant12_d_invariantFC6ObjectZv`` and has no demagled name that
291 I know of, but I guessed is the Object class invariant.
293 .. [#backtrace] Here is what I get from GDB_::
296 #0 0x08079a96 in getDispatchFunction ()
297 #1 0xb78d5000 in ?? ()
298 #2 0xb789d000 in ?? ()
299 Backtrace stopped: previous frame inner to this frame (corrupt stack?)
301 (function name unmangled and shortened for readbility)
304 .. vim: set et sw=4 sts=4 :