source/blog/posts/2010/09/09-truly-concurrent-gc-using-eager-allocation.rst

   1 Title: Truly concurrent GC using eager allocation
   2 Tags: en, d, dgc, cdgc, eager allocation, concurrent, fork
   3
   4 Finally, I got the first version of CDGC__ with truly concurrent garbage
   5 collection, in the sense that all the threads of the *mutator* (the program
   6 itself) can run in parallel with the collector (well, only the mark phase to be
   7 honest :).
   8
   9 __ http://llucax.com.ar/blog/blog/tag/cdgc?sort=+date
  10
  11 You might want to read a `previous post`__ about CDGC where I achieved some sort
  12 of concurrency by making only the *stop-the-world* time very short, but the
  13 thread that triggered the collection (and any other thread needing **any** GC
  14 service) had to wait until the collection finishes. The thread that triggered
  15 the collection needed to wait for the collection to finish to fulfill the memory
  16 allocation request (it was triggered because the memory was exhausted), while
  17 any other thread needing any GC service needed to acquire the global GC lock
  18 (damn global GC lock!).
  19
  20 __ http://llucax.com.ar/blog/blog/post/-4c9dd5b5
  21
  22 To avoid this issue, I took a simple approach that I call *eager allocation*,
  23 consisting on spawn the mark phase concurrently but allocating a new memory pool
  24 to be able to fulfill the memory request instantly. Doing so, not only the
  25 thread that triggered the collection can keep going without waiting the
  26 collection to finish, the global GC lock is released and any other thread can
  27 use any GC service, and even allocate more memory, since a new pool was
  28 allocated.
  29
  30 If the memory is exhausted again before the collection finishes, a new pool is
  31 allocated, so everything can keep running. The obvious (bad) consequence of this
  32 is potential memory bloat. Since the memory usage is minimized from time to
  33 time, this effect should not be too harmful though, but let's see the results,
  34 there are plenty of things to analyze from them (a lot not even related to
  35 concurrency).
  36
  37 First, a couple of comments about the plots:
  38
  39 * Times of Dil are multiplied by a factor of 0.1 in **all** the plots, times of
  40   rnddata are too, but only in the pause time and stop-the-world plots. This is
  41   only to make the plots more readable.
  42 * The unreadable labels rotated 45 degrees say: **stw**, **fork** and **ea**.
  43   Those stand for *Stop-the-world* (the basic collector), *fork only*
  44   (concurrent but without eager allocation) and *eager allocation* respectively.
  45   You can click on the images to see a little more readable SVG version.
  46 * The plots are for one CPU-only because using more CPUs doesn't change much
  47   (for these plots).
  48 * The times were taken from a **single** run, unlike the total run time plots
  49   I usually post. Since a single run have multiple collections, the information
  50   about min, max, average and standard deviation still applies for the single
  51   run.
  52 * *Stop-the-world* time is the time no mutator thread can run. This is not
  53   related to the global GC lock, is time the threads are really really paused
  54   (this is even necessary for the forking GC to take a snapshot of threads CPU
  55   registers and stacks). So, the time no mutator thread can do any useful work
  56   might be much bigger than this time, because the GC lock. This time is what
  57   I call *Pause* time. The maximum pause time is probably the most important
  58   variable for a GC that tries to minimize pauses, like this one. Is the maximum
  59   time a program will stay totally unresponsive (important for a server, a GUI
  60   application, a game or any interactive application).
  61
  62
  63 .. image:: ##POST_URL##/09-truly-concurrent-gc/stw-1cpu.png
  64    :target: ##POST_URL##/09-truly-concurrent-gc/stw-1cpu.svg
  65    :align: center
  66    :alt: Stop-the-world time for 1 CPU
  67
  68 The *stop-the-world* time is reduced so much that you can hardly see the times
  69 of the **fork** and **ea** configuration. It's reduced in all tests by a big margin,
  70 except for **mcore** and the **bigarr**. For the former it was even increased
  71 a little, for the later it was reduced but very little (but only for the *ea**
  72 configuration, so it might be a bad measure). This is really measuring the
  73 Linux__ `fork()`__ time. When the program manages so little data that the mark
  74 phase itself is so fast that's faster than a fork(), this is what happens. The
  75 good news is, the pause times are small enough for those cases, so no harm is
  76 done (except from adding a little more total run time to the program).
  77
  78 __ http://en.wikipedia.org/wiki/Linux
  79 __ http://linux.die.net/man/2/fork
  80
  81 Note the Dil maximum *stop-the-world* time, it's 0.2 seconds, looks pretty big,
  82 uh?  Well, now remember that this time was multiplied by 0.1, the real maximum
  83 *stop-the-world* for Dil is **2 seconds**, and remember this is the **minimum**
  84 amount of time the program is unresponsive! Thank god it's not an interactive
  85 application :)
  86
  87
  88 Time to take a look to the **real** pause time:
  89
  90 .. image:: ##POST_URL##/09-truly-concurrent-gc/pause-1cpu.png
  91    :target: ##POST_URL##/09-truly-concurrent-gc/pause-1cpu.svg
  92    :align: center
  93    :alt: Pause time for 1 CPU
  94
  95 OK, this is a little more confusing... The only strong pattern is that pause
  96 time is not changed (much) between the **swt** and **fork** configurations. This
  97 seems to make sense, as both configurations must wait for the whole collection
  98 to finish (I really don't know what's happening with the **bh** test).
  99
 100 For most tests (7), the pause time is much smaller for the **ea** configuration,
 101 3 tests have much bigger times for it, one is bigger but similar (again
 102 **mcore**) and then is the weird case of **bh**. The 7 tests where the time is
 103 reduced are the ones that seems to make sense, that's what I was looking for, so
 104 let's see what's happening with the remaining 3, and for that, let's take a look
 105 at the amount of memory the program is using, to see if the memory bloat of
 106 allocating extra pools is significant.
 107
 108 ======= ======= ======= =======
 109  Test    Maximum heap size (MB)
 110 ------- -----------------------
 111 Program   stw      ea    ea/stw
 112 ======= ======= ======= =======
 113 dil         216     250    1.16
 114 rnddata     181     181    1
 115 voronoi      16      30    1.88
 116 tree          7     114   16.3
 117 bh           80      80    1
 118 mcore        30      38    1.27
 119 bisort       30      30    1
 120 bigarr       11     223   20.3
 121 em3d         63      63    1
 122 sbtree       11     122   11.1
 123 tsp          63      63    1
 124 split        39      39    1
 125 ======= ======= ======= =======
 126
 127 See any relations between the plot and the table? I do. It looks like some
 128 programs are not being able to minimize the memory usage, and because of that,
 129 the sweep phase (which still have to run in a mutator thread, taking the global
 130 GC lock) is taking ages. An easy to try approach is to trigger the minimization
 131 of the memory usage not only at when big objects are allocated (like it is now),
 132 but that could lead to more mmap()/munmap()s than necessary. And there still
 133 problems with pools that are kept alive because a very small object is still
 134 alive, which is not solved by this.
 135
 136 So I think a more long term solution would be to introduce what I call *early
 137 collection* too. Meaning, trigger a collection **before** the memory is
 138 exhausted. That would be the next step in the CDGC.
 139
 140
 141 Finally, let's take a look at the total run time of the test programs using the
 142 basic GC and CDGC with concurrent marking and eager allocation. This time, let's
 143 see what happens with 2 CPUs (and 25 runs):
 144
 145 .. image:: ##POST_URL##/09-truly-concurrent-gc/time-2cpu.png
 146    :target: ##POST_URL##/09-truly-concurrent-gc/time-2cpu.svg
 147    :align: center
 148    :alt: Total run time for 2 CPUs (25 runs)
 149
 150 **Wow!** It looks like this is getting really juicy (with exceptions, as usual
 151 :)! **Dil** time is reduced to about 1/3, **voronoi** is reduced to 1/10!!!
 152 Split and mcore have both their time considerably reduced, but that's because
 153 another small optimization (unrelated to what we are seeing today), so forget
 154 about those two. Same for rnddata, which is reduced because of precise heap
 155 scanning. But other tests increased its runtime, most notably **bigarr** takes
 156 almost double the time. Looking at the maximum heap size table, one can find
 157 some answers for this too. Another ugly side of *early allocation*.
 158
 159 For completeness, let's see what happens with the number of collections
 160 triggered during the program's life. Here is the previous table with this new
 161 data added:
 162
 163 ======= ======= ======= ======= ======= ======= =======
 164  Test    Maximum heap size (MB)  Number of collections
 165 ------- ----------------------- -----------------------
 166 Program   stw      ea    ea/stw   stw      ea    ea/stw
 167 ======= ======= ======= ======= ======= ======= =======
 168 dil         216     250    1.16      62      50    0.81
 169 rnddata     181     181    1         28      28    1
 170 voronoi      16      30    1.88      79      14    0.18
 171 tree          7     114   16.3      204      32    0.16
 172 bh           80      80    1         27      27    1
 173 mcore        30      38    1.27      18      14    0.78
 174 bisort       30      30    1         10      10    1
 175 bigarr       11     223   20.3      305      40    0.13
 176 em3d         63      63    1         14      14    1
 177 sbtree       11     122   11.1      110      33    0.3
 178 tsp          63      63    1         14      14    1
 179 split        39      39    1          7       7    1
 180 ======= ======= ======= ======= ======= ======= =======
 181
 182 See how the number of collections is practically reduced proportionally to the
 183 increase of the heap size. When the increase in size explodes, even when the
 184 number of collections is greatly reduced, the sweep time take over and the total
 185 run time is increased. Specially in those tests where the program is almost only
 186 using the GC (as in sbtree and bigarr). That's why I like the most Dil and
 187 voronoi as key tests, they do quite a lot of real work beside asking for memory
 188 or using other GC services.
 189
 190 This confirms that the performance gain is not strictly related to the added
 191 concurrency, but because of a nice (finally! :) side-effect of eager allocation:
 192 removing some pressure from the GC by increasing the heap size a little (Dil
 193 gets 3x boost in run time for as little as 1.16x of memory usage; voronoi gets
 194 10x at the expense of almost doubling the heap, I think both are good
 195 trade-offs). This shows another weak point of the GC, sometimes the HEAP is way
 196 too tight, triggering a lot of collections, which leads to a lot of GC run time
 197 overhead. Nothing is done right now to keep a good heap occupancy ratio.
 198
 199 But is there any real speed (in total run time terms) improvement because of the
 200 added concurrency? Let's see the run time for 1 CPU:
 201
 202 .. image:: ##POST_URL##/09-truly-concurrent-gc/time-1cpu.png
 203    :target: ##POST_URL##/09-truly-concurrent-gc/time-1cpu.svg
 204    :align: center
 205    :alt: Total run time for 1 CPU (25 runs)
 206
 207 It looks like there is, specially for my two favourite tests: both Dil and
 208 voronoi get a **30% speed boost**! That's not bad, not bad at all...
 209
 210
 211 If you want to try it, the repository__ has been updated with this last changes
 212 :). If you do, please let me know how it went.
 213
 214 __ http://git.llucax.com.ar/w/software/dgc/cdgc.git
 215
 216
 217 .. vim: set et sw=3 sts=3 :