1 Title: Truly concurrent GC using eager allocation
2 Tags: en, d, dgc, cdgc, eager allocation, concurrent, fork
4 Finally, I got the first version of CDGC__ with truly concurrent garbage
5 collection, in the sense that all the threads of the *mutator* (the program
6 itself) can run in parallel with the collector (well, only the mark phase to be
9 __ https://llucax.com.ar/blog/blog/tag/cdgc?sort=+date
11 You might want to read a `previous post`__ about CDGC where I achieved some sort
12 of concurrency by making only the *stop-the-world* time very short, but the
13 thread that triggered the collection (and any other thread needing **any** GC
14 service) had to wait until the collection finishes. The thread that triggered
15 the collection needed to wait for the collection to finish to fulfill the memory
16 allocation request (it was triggered because the memory was exhausted), while
17 any other thread needing any GC service needed to acquire the global GC lock
18 (damn global GC lock!).
20 __ https://llucax.com.ar/blog/blog/post/-4c9dd5b5
22 To avoid this issue, I took a simple approach that I call *eager allocation*,
23 consisting on spawn the mark phase concurrently but allocating a new memory pool
24 to be able to fulfill the memory request instantly. Doing so, not only the
25 thread that triggered the collection can keep going without waiting the
26 collection to finish, the global GC lock is released and any other thread can
27 use any GC service, and even allocate more memory, since a new pool was
30 If the memory is exhausted again before the collection finishes, a new pool is
31 allocated, so everything can keep running. The obvious (bad) consequence of this
32 is potential memory bloat. Since the memory usage is minimized from time to
33 time, this effect should not be too harmful though, but let's see the results,
34 there are plenty of things to analyze from them (a lot not even related to
37 First, a couple of comments about the plots:
39 * Times of Dil are multiplied by a factor of 0.1 in **all** the plots, times of
40 rnddata are too, but only in the pause time and stop-the-world plots. This is
41 only to make the plots more readable.
42 * The unreadable labels rotated 45 degrees say: **stw**, **fork** and **ea**.
43 Those stand for *Stop-the-world* (the basic collector), *fork only*
44 (concurrent but without eager allocation) and *eager allocation* respectively.
45 You can click on the images to see a little more readable SVG version.
46 * The plots are for one CPU-only because using more CPUs doesn't change much
48 * The times were taken from a **single** run, unlike the total run time plots
49 I usually post. Since a single run have multiple collections, the information
50 about min, max, average and standard deviation still applies for the single
52 * *Stop-the-world* time is the time no mutator thread can run. This is not
53 related to the global GC lock, is time the threads are really really paused
54 (this is even necessary for the forking GC to take a snapshot of threads CPU
55 registers and stacks). So, the time no mutator thread can do any useful work
56 might be much bigger than this time, because the GC lock. This time is what
57 I call *Pause* time. The maximum pause time is probably the most important
58 variable for a GC that tries to minimize pauses, like this one. Is the maximum
59 time a program will stay totally unresponsive (important for a server, a GUI
60 application, a game or any interactive application).
63 .. image:: ##POST_URL##/09-truly-concurrent-gc/stw-1cpu.png
64 :target: ##POST_URL##/09-truly-concurrent-gc/stw-1cpu.svg
66 :alt: Stop-the-world time for 1 CPU
68 The *stop-the-world* time is reduced so much that you can hardly see the times
69 of the **fork** and **ea** configuration. It's reduced in all tests by a big margin,
70 except for **mcore** and the **bigarr**. For the former it was even increased
71 a little, for the later it was reduced but very little (but only for the *ea**
72 configuration, so it might be a bad measure). This is really measuring the
73 Linux__ `fork()`__ time. When the program manages so little data that the mark
74 phase itself is so fast that's faster than a fork(), this is what happens. The
75 good news is, the pause times are small enough for those cases, so no harm is
76 done (except from adding a little more total run time to the program).
78 __ http://en.wikipedia.org/wiki/Linux
79 __ http://linux.die.net/man/2/fork
81 Note the Dil maximum *stop-the-world* time, it's 0.2 seconds, looks pretty big,
82 uh? Well, now remember that this time was multiplied by 0.1, the real maximum
83 *stop-the-world* for Dil is **2 seconds**, and remember this is the **minimum**
84 amount of time the program is unresponsive! Thank god it's not an interactive
88 Time to take a look to the **real** pause time:
90 .. image:: ##POST_URL##/09-truly-concurrent-gc/pause-1cpu.png
91 :target: ##POST_URL##/09-truly-concurrent-gc/pause-1cpu.svg
93 :alt: Pause time for 1 CPU
95 OK, this is a little more confusing... The only strong pattern is that pause
96 time is not changed (much) between the **swt** and **fork** configurations. This
97 seems to make sense, as both configurations must wait for the whole collection
98 to finish (I really don't know what's happening with the **bh** test).
100 For most tests (7), the pause time is much smaller for the **ea** configuration,
101 3 tests have much bigger times for it, one is bigger but similar (again
102 **mcore**) and then is the weird case of **bh**. The 7 tests where the time is
103 reduced are the ones that seems to make sense, that's what I was looking for, so
104 let's see what's happening with the remaining 3, and for that, let's take a look
105 at the amount of memory the program is using, to see if the memory bloat of
106 allocating extra pools is significant.
108 ======= ======= ======= =======
109 Test Maximum heap size (MB)
110 ------- -----------------------
111 Program stw ea ea/stw
112 ======= ======= ======= =======
125 ======= ======= ======= =======
127 See any relations between the plot and the table? I do. It looks like some
128 programs are not being able to minimize the memory usage, and because of that,
129 the sweep phase (which still have to run in a mutator thread, taking the global
130 GC lock) is taking ages. An easy to try approach is to trigger the minimization
131 of the memory usage not only at when big objects are allocated (like it is now),
132 but that could lead to more mmap()/munmap()s than necessary. And there still
133 problems with pools that are kept alive because a very small object is still
134 alive, which is not solved by this.
136 So I think a more long term solution would be to introduce what I call *early
137 collection* too. Meaning, trigger a collection **before** the memory is
138 exhausted. That would be the next step in the CDGC.
141 Finally, let's take a look at the total run time of the test programs using the
142 basic GC and CDGC with concurrent marking and eager allocation. This time, let's
143 see what happens with 2 CPUs (and 25 runs):
145 .. image:: ##POST_URL##/09-truly-concurrent-gc/time-2cpu.png
146 :target: ##POST_URL##/09-truly-concurrent-gc/time-2cpu.svg
148 :alt: Total run time for 2 CPUs (25 runs)
150 **Wow!** It looks like this is getting really juicy (with exceptions, as usual
151 :)! **Dil** time is reduced to about 1/3, **voronoi** is reduced to 1/10!!!
152 Split and mcore have both their time considerably reduced, but that's because
153 another small optimization (unrelated to what we are seeing today), so forget
154 about those two. Same for rnddata, which is reduced because of precise heap
155 scanning. But other tests increased its runtime, most notably **bigarr** takes
156 almost double the time. Looking at the maximum heap size table, one can find
157 some answers for this too. Another ugly side of *early allocation*.
159 For completeness, let's see what happens with the number of collections
160 triggered during the program's life. Here is the previous table with this new
163 ======= ======= ======= ======= ======= ======= =======
164 Test Maximum heap size (MB) Number of collections
165 ------- ----------------------- -----------------------
166 Program stw ea ea/stw stw ea ea/stw
167 ======= ======= ======= ======= ======= ======= =======
168 dil 216 250 1.16 62 50 0.81
169 rnddata 181 181 1 28 28 1
170 voronoi 16 30 1.88 79 14 0.18
171 tree 7 114 16.3 204 32 0.16
173 mcore 30 38 1.27 18 14 0.78
174 bisort 30 30 1 10 10 1
175 bigarr 11 223 20.3 305 40 0.13
177 sbtree 11 122 11.1 110 33 0.3
180 ======= ======= ======= ======= ======= ======= =======
182 See how the number of collections is practically reduced proportionally to the
183 increase of the heap size. When the increase in size explodes, even when the
184 number of collections is greatly reduced, the sweep time take over and the total
185 run time is increased. Specially in those tests where the program is almost only
186 using the GC (as in sbtree and bigarr). That's why I like the most Dil and
187 voronoi as key tests, they do quite a lot of real work beside asking for memory
188 or using other GC services.
190 This confirms that the performance gain is not strictly related to the added
191 concurrency, but because of a nice (finally! :) side-effect of eager allocation:
192 removing some pressure from the GC by increasing the heap size a little (Dil
193 gets 3x boost in run time for as little as 1.16x of memory usage; voronoi gets
194 10x at the expense of almost doubling the heap, I think both are good
195 trade-offs). This shows another weak point of the GC, sometimes the HEAP is way
196 too tight, triggering a lot of collections, which leads to a lot of GC run time
197 overhead. Nothing is done right now to keep a good heap occupancy ratio.
199 But is there any real speed (in total run time terms) improvement because of the
200 added concurrency? Let's see the run time for 1 CPU:
202 .. image:: ##POST_URL##/09-truly-concurrent-gc/time-1cpu.png
203 :target: ##POST_URL##/09-truly-concurrent-gc/time-1cpu.svg
205 :alt: Total run time for 1 CPU (25 runs)
207 It looks like there is, specially for my two favourite tests: both Dil and
208 voronoi get a **30% speed boost**! That's not bad, not bad at all...
211 If you want to try it, the repository__ has been updated with this last changes
212 :). If you do, please let me know how it went.
214 __ https://git.llucax.com.ar/w/software/dgc/cdgc.git
217 .. vim: set et sw=3 sts=3 :