1 Title: Recursive vs. iterative marking
2 Tags: en, d, gc, dgc, cdgc, mark, recursive, iterative, performance, benchmark
4 After a `small (but important) step`__ towards making the D__ GC__ truly
5 concurrent__ (which is my main goal), I've been exploring the possibility of
6 making the mark phase recursive instead of iterative (as it currently is).
8 __ https://llucax.com.ar/blog/blog/post/-4c9dd5b5
9 __ http://www.digitalmars.com/d/
10 __ http://en.wikipedia.org/wiki/Garbage_collection_%28computer_science%29
11 __ http://www.memorymanagement.org/glossary/c.html#concurrent.garbage.collection
13 The motivation is that the iterative algorithm makes several passes through the
14 **entire heap** (it doesn't need to do the **full job** on each pass, it
15 processes only the newly reachable nodes found in the previous iteration, but to
16 look for that new reachable node it does have to iterate over the entire heap).
17 The number of passes is the same as the connectivity graph depth, the best case
18 is where all the heap is reachable through the root set, and the worse is when
19 the heap is a single linked list. The recursive algorithm, on the other hand,
20 needs only a single pass but, of course, it has the problem of potentially
21 consuming a lot of stack space (again, the recurse depth is the same as the
22 connectivity graph depth), so it's not paradise either.
24 To see how much of a problem is the recurse depth in reality, first I've
25 implemented a fully recursive algorithm, and I found **it is** a real problem,
26 since I had segmentation faults because the (8MiB by default in Linux) stack
27 overflows. So I've implemented an hybrid approach, setting a (configurable)
28 maximum recurse depth for the marking phase. If the maximum depth is reached,
29 the recursion is stopped and nodes that should be scanned deeply than that are
30 queued to scanned in the next iteration.
32 Here are some results showing how the total run time is affected by the maximum
35 .. image:: ##POST_URL##/29-recursive-dil.png
38 .. image:: ##POST_URL##/29-recursive-voronoi.png
41 The red dot is how the pure iterative algorithm currently performs (it's placed
42 arbitrarily in the plot, as the X-axis doesn't make sense for it).
44 The results are not very conclusive. Even when the hybrid approach performs
45 better for both Dil__ and Voronoi__ when the maximum depth is bigger than 75, the
46 better depth is program specific. Both have its worse case when depth is 0,
47 which makes sense, because is paying the extra complexity of the hybrid
48 algorithm with using its power. As soon as we leave the 0 depth, a big drop is
49 seen, for Voronoi big enough to outperform the purely iterative algorithm, but
50 not for Dil, which matches it near 60 and clearly outperforms it at 100.
52 __ http://code.google.com/p/dil/
53 __ http://codepad.org/xGDCS3KO
55 As usual, Voronoi challenges all logic, as the best depth is **31** (it was
56 a consistent result among several runs). Between 20 and 50 there is not much
57 variation (except for the magic number **31**) but when going beyond that, it
58 worsen slowly but constantly as the depth is increased.
60 Note that the plots might make the performance improvement look a little bigger
61 than it really is. The best case scenario the gain is 7.5% for Voronoi and 3%
62 for Dil (which is probably better measure for the real world). If I had to
63 choose a default, I'll probably go with 100 because is where both get
64 a performance gain and is still a small enough number to ensure no segmentation
65 faults due to stack exhaustion is caused (only) by the recursiveness of the mark
66 phase (I guess a value of 1000 would be reasonable too, but I'm a little scared
67 of causing inexplicable, magical, mystery segfaults to users). Anyway, for
68 a value of 100, the performance gain is about 1% and 3.5% for Dil and Voronoi
71 So I'm not really sure if I should merge this change or not. In the best case
72 scenarios (which requires a work from the user to search for the better depth
73 for its program), the performance gain is not exactly **huge** and for
74 a reasonable default value is so little that I'm not convinced the extra
75 complexity of the change (because it makes the marking algorithm a little more
78 Feel free to leave your opinion (I would even appreciate it if you do :).
81 .. vim: set et sw=3 sts=3 :