source/blog/posts/2009/04/understanding-the-current-gc-conclusion.rst

   1 Title: Understanding the current GC, conclusion
   2 Tags: en, d, dgc, understanding the current gc, druntime, gc, mark-sweep, conclusion, book
   3
   4 Now that I know fairly deeply the implementation details about the current GC,
   5 I can compare it to the techniques exposed in the `GC Book`_.
   6
   7 Tri-colour abstraction
   8 ======================
   9
  10 Since most literature speaks in terms of `the tri-colour abstraction`_, now
  11 it's a good time to translate how this is mapped to the D_ GC implementation.
  12
  13 As we all remember__, each cell (bin) in D_ has several bits associated to them.
  14 Only 3 are interesting in this case:
  15
  16 __ /blog/blog/tag/understanding%20the%20current%20gc
  17
  18 * mark
  19 * scan
  20 * free (``freebits``)
  21
  22 So, how we can translate this bits into `the tri-colour abstraction`_?
  23
  24 Black
  25     Cells that were marked and scanned (there are no pointer to follow) are
  26     coloured black. In D_ this cells has the bits::
  27
  28         mark = 1
  29         scan = 0
  30         free = 0
  31
  32 Grey
  33     Cells that has been marked, but they have pointers to follow in them are
  34     coloured grey. In D_ this cells has the bits::
  35
  36         mark = 1
  37         scan = 1
  38         free = 0
  39
  40 White
  41     Cells that has not been visited at all are coloured white (all cells should
  42     be colored white before the marking starts). In D_ this cells has the bits::
  43
  44         mark = 0
  45         scan = X
  46
  47     Or::
  48
  49         free = 1
  50
  51     The ``scan`` bit is not important in this case (but in D_ it should be
  52     0 because scan bits are cleared before the mark phase starts). The ``free``
  53     bit is used for the cells in the free list. They are marked before other
  54     cells get marked with bits ``mark=1`` and ``free=1``. This way the cells in
  55     the free list don't get scanned (``mark=1``, ``scan=0``) and are not
  56     confused with **black** cells (``free=1``), so they can be kept in the free
  57     list after the mark phase is done. I think this is only necessary because
  58     the free list is regenerated.
  59
  60
  61 Improvements
  62 ============
  63
  64 Here is a summary of improvements proposed by the `GC Book`_, how the current
  65 GC is implemented in regards to this improvements and what optimization
  66 opportunities can be considered.
  67
  68 Mark stack
  69 ----------
  70
  71 The simplest version of the marking algorithm is recursive::
  72
  73     mark(cell)
  74         if not cell.marked
  75             cell.marked = true
  76             for child in cell.children
  77                 mark(child)
  78
  79 The problem here is, of course, stack overflow for very deep heap graphs (and
  80 the space cost).
  81
  82 The book proposes using a marking stack instead, and several ways to handle
  83 stack overflow, but all these are only useful for relieving the symptom, they
  84 are not a cure.
  85
  86 As a real cure, pointer reversal is proposed. The idea is to use the very same
  87 pointers to store the *mark stack*. This is constant in space, and needs only
  88 one pass through the help, so it's a very tempting approach. The bad
  89 side is increased complexity and probably worse cache behavior (writes to the
  90 heap dirties the entire heap, and this can kill the cache).
  91
  92 .. admonition:: Current implementation
  93
  94     The D_ GC implementation does none of this. Instead it completes the mark
  95     phase by traversing the heap (well, not really the heap, only the bit sets)
  96     in several passes, until no more data to scan can be found (all cells are
  97     painted black or white). While the original algorithm only needs one pass
  98     through the heap, this one need  several. This trades space (and the
  99     complexity of stack overflow handling) for time.
 100
 101 .. admonition:: Optimization opportunities
 102
 103     This seems like a fair trade-off, but alternatives can be explored.
 104
 105 Bitmap marking
 106 --------------
 107
 108 The simplest mark-sweep algorithm suggests to store marking bits in the very
 109 own cells. This can be very bad for the cache because a full traversal should
 110 be done across the entire heap. As an optimization, a bitmap can be used,
 111 because they are much small and much more likely to fit in the cache, marking
 112 can be greatly improved using them.
 113
 114 .. admonition:: Current implementation
 115
 116     Current implementation uses bitmaps for mark, scan, free and other bits.
 117     The bitmap implementation is ``GCBits`` and is a general approach.
 118
 119     The bitmap stores a bit for each 16 bytes chunks, no matter what cell size
 120     (``Bins``, or bin size) is used. This means that 4096/16 = 256 bits (32
 121     bytes) are used for each bitmap for every page in the GC heap. Being
 122     5 bitmaps (mark, scan, freebits, finals and noscan), the total spaces per
 123     page is 160 bytes. This is a 4% space overhead in bits only.
 124
 125     This wastes some space for larger cells.
 126
 127 .. admonition:: Optimization opportunities
 128
 129     The space overhead of bitmaps seems to be fairly small, but each byte
 130     counts for the mark phase because of the cache. A heap with 64 MiB uses 2.5
 131     MiB in bitmaps. Modern processors come with about that much cache, and
 132     a program using 64 MiB doesn't seems very rare. So we are pushing the
 133     limits here if we want our bitmaps to fit in the cache to speed up the
 134     marking phase.
 135
 136     I think there is a little room for improvement here. A big object, lets
 137     say it's 8 MiB long, uses 640 KiB of memory for bitmaps it doesn't need.
 138     I think some specialized bitmaps can be used for large object, for
 139     instance, to minimize the bitmaps space overhead.
 140
 141     There are some overlapping bits too. ``mark=0`` and ``scan=1`` can never
 142     happen for instance. I think it should be possible to use that combination
 143     for ``freebits``, and get rid of an entire bitmap.
 144
 145 Lazy sweep
 146 ----------
 147
 148 The sweep phase is done generally right after the mark phase. Since normally
 149 the collection is triggered by an allocation, this can be a little disrupting
 150 for the thread that made that allocation, that has to absorb all the sweeping
 151 itself.
 152
 153 Another alternative is to do the sweeping incrementally, by doing it lazy.
 154 Instead of finding all the white cells and linking them to the free list
 155 immediately, this is done on each allocation. If there is no free cells in the
 156 free list, a little sweeping is done until new space can be found.
 157
 158 This can help minimize pauses for the allocating thread.
 159
 160 .. admonition:: Current implementation
 161
 162     The current implementation does an eager sweeping.
 163
 164 .. admonition:: Optimization opportunities
 165
 166     The sweeping phase can be made lazy. The only disadvantage I see is (well,
 167     besides extra complexity) that could make the heap more likely to be
 168     fragmented, because consecutive requests are not necessarily made on the
 169     same page (a ``free()`` call can add new cells from another page to the
 170     free list), making the heap more sparse, (which can be bad for the cache
 171     too). But I think this is only possible if ``free()`` is called explicitly,
 172     and this should be fairly rare in a garbage collected system, so I guess
 173     this could worth trying.
 174
 175     Lazy sweeping helps the cache too, because in the sweep phase, you might
 176     trigger cache misses when linking to the free list. When sweeping lazily,
 177     the cache miss is delayed until it's really necessary (the cache miss will
 178     happen anyway when you are allocating the free cell).
 179
 180
 181 Conclusion
 182 ==========
 183
 184 Even when the current GC is fairly optimized, there is plenty of room for
 185 improvements, even preserving the original global design.
 186
 187
 188 .. _`GC Book`: http://www.cs.kent.ac.uk/people/staff/rej/gcbook/
 189 .. _`the tri-colour abstraction`: http://en.wikipedia.org/wiki/Garbage_collection_(computer_science)#Tri-colour_marking
 190 .. _D: http://www.digitalmars.com/d/
 191
 192 .. vim: set et sw=4 sts=4 :