1 Title: Memory allocation patterns
2 Tags: en, d, gc, dgc, cdgc, memory, allocation, pattern, dil, benchmark
6 Tango__ 0.99.9 has a bug__ in its runtime, which sometimes makes the GC scan
7 memory that should not be scanned. It only affects Dil and Voronoi programs,
8 but in a significant way. The tests in this post are done using a patched
9 runtime, with the bug fixed.
11 __ http://www.dsource.org/projects/tango
12 __ http://www.dsource.org/projects/tango/ticket/1971
15 .. admonition:: Update
17 The results for the unpublished programs are now available__. You can find
18 the graphic results, the detailed summary and the source code for all the
19 programs (except dil, which can be downloaded from its home site).
21 __ ##POST_URL##/14-memory-allocation/
24 After seeing some weird behaviours and how different benchmarks are more or less
25 affected by changes like `memory addresses returned by the OS`__ or by
26 `different ways to store the type information pointer`__, I decided to gather
27 some information about how much and what kind of memory are requested by the
30 __ http://www.llucax.com.ar/blog/blog/post/-7a56a111
31 __ http://www.llucax.com.ar/blog/blog/post/1490c03e
33 I used the information provided by the ``malloc_stats_file`` CDGC__ option, and
36 __ http://www.llucax.com.ar/blog/blog/post/-2c067531
38 The analysis is done on the allocations requested by the program (calls to
39 ``gc_malloc()``) and contrasting that with the real memory allocated by the GC.
40 Note that only the GC heap memory (that is, memory dedicated to the program,
41 which the GC scans in the collections) is counted (internal GC memory used for
44 Also note that in this post I generally refer to object meaning a block of
45 memory, it doesn't mean they are actually instance of a class or anything.
46 Finally bear in mind that all the figures shown here are the sum of all the
47 allocations done in the life of a program. If the collected data says a program
48 requested 1GiB of memory, that doesn't mean the program had a residency of 1GiB,
49 the program could had a working set of a few KiB and recycled memory like hell.
51 When analyzing the real memory allocated by the GC, there are two modes being
52 analyzed, one is the classic conservative mode and the other is the precise mode
53 (as it is in the original patch, storing the type information pointer at the end
54 of the blocks). So the idea here is to measure two major things:
56 * The amount of memory wasted by the GC because of `how it arranges memory`__ as
57 fixed-size blocks (*bins*) and large objects that uses whole pages.
58 * The extra amount of memory wasted by the GC when using precise mode because it
59 stores the type information pointer at the end of the blocks.
61 __ http://www.llucax.com.ar/blog/blog/post/250bf643
63 I've selected a few representative benchmarks. Here are the results:
69 This is a translation__ by `Leonardo Maffi`__ from the `Olden Benchmark`__ that
70 does a `Barnes–Hut simulation`__. The program is CPU intensive an does a lot of
71 allocation of about 5 different small objects.
73 __ http://www.fantascienza.net/leonardo/js/dolden_bh.zip
74 __ http://www.fantascienza.net/leonardo/
75 __ http://www.irisa.fr/caps/people/truong/M2COct99/Benchmarks/Olden/Welcome.html
76 __ http://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation
78 Here is a graphic summary of the allocation requests and real allocated memory
79 for a run with ``-b 4000``:
84 .. image:: ##POST_URL##/14-memory-allocation/bh.rq.tot.png
86 .. image:: ##POST_URL##/14-memory-allocation/bh.rq.bin.png
88 .. image:: ##POST_URL##/14-memory-allocation/bh.ws.tot.png
90 .. image:: ##POST_URL##/14-memory-allocation/bh.ws.bin.png
92 We can easily see here how the space wasted by the GC memory organization is
93 significant (about 15% wasted), and how the type information pointer is adding
94 an even more significant overhead (about 36% of the memory is wasted). This
95 means that this program will be 15% more subject to false pointers (and will
96 have to scan some extra memory too, but fortunately the majority of the memory
97 doesn't need to be scanned) than it should in conservative mode and that the
98 precise mode makes things 25% worse.
100 You can also see how the extra overhead in the precise mode is because some
101 objects that should fit in a 16 bin now need a 32 bytes bin to hold the extra
102 pointer. See how there were no waste at all in the conservative mode for objects
103 that should fit a 16 bytes bin. **117MiB are wasted because of that**.
105 Here is a more detailed (but textual) summary of the memory requested and
108 .. include:: ##POST_DIR##/14-memory-allocation/bh.txt
111 bigarr allocation pattern
112 -------------------------
114 This is a extremely simple program that just allocate a big array of
115 small-medium objects (all of the same size) `I found in the D NG`__.
117 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=54084
119 Here is the graphic summary:
124 .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.tot.png
126 .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.bin.png
128 .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.tot.png
130 .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.bin.png
132 The only interesting part of this test is how many space is wasted because of
133 the memory organization, which in this case goes up to 30% for the conservative
134 mode (and have no change for the precise mode).
136 Here is the detailed summary:
138 .. include:: ##POST_DIR##/14-memory-allocation/bigarr.txt
141 mcore allocation pattern
142 ------------------------
144 This is program that test the contention produced by the GC when appending to
145 (thread-specific) arrays in several threads concurrently (again, `found at the
146 D NG`__). For this analysis the concurrency doesn't play any role though, is
147 just a program that do a lot of appending to a few arrays.
149 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=103563
151 Here are the graphic results:
156 .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.tot.png
158 .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.bin.png
160 .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.tot.png
162 .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.bin.png
164 This is the most boring of the examples, as everything works as expected =)
166 You can clearly see how the arrays grow, passing through each bin size and
167 finally becoming big objects which take most of the allocated space. Almost
168 nothing need to be scanned (they are int arrays), and practically there is no
169 waste. That's a good decision by the array allocation algorithm, which seems to
170 exploit the bin sizes to the maximum. Since almost all the data is doesn't need
171 to be scanned, there is no need to store the type information pointers, so there
172 is no waste either for the precise mode (the story would be totally different if
173 the arrays were of objects that should be scanned, as probably each array
174 allocation would waste about 50% of the memory to store the type information
177 Here is the detailed summary:
179 .. include:: ##POST_DIR##/14-memory-allocation/mcore.txt
182 voronoi allocation pattern
183 --------------------------
185 This is one of my favourites, because is always problematic. It "computes the
186 `voronoi diagram`__ of a set of points recursively on the tree" and is also
187 taken from the Olden Benchmark and translated__ by Leonardo Maffi to D.
189 __ http://en.wikipedia.org/wiki/Voronoi_diagram
190 __ http://codepad.org/xGDCS3KO
192 Here are the graphic results for a run with ``-n 30000``:
197 .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.tot.png
199 .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.bin.png
201 .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.tot.png
203 .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.bin.png
205 This have a little from all the previous examples. Practically all the heap
206 should be scanned (as in **bigarr**), it wastes a considerably portion of the
207 heap because of the fixed-size blocks (as all but **mcore**), it wastes just
208 a very little more because of type information (as all but **bh**) but that
209 waste comes from objects that should fit in a 16 bytes bin but is stored in a 32
210 bytes bin instead (as in **bh**).
212 Maybe that's why it's problematic, it touches a little mostly all the GC flaws.
214 Here is the detailed summary:
216 .. include:: ##POST_DIR##/14-memory-allocation/voronoi.txt
219 Dil allocation pattern
220 ----------------------
222 Finally, this is by far my favourite, the only real-life program, and the most
223 colorful example (literally =).
225 Dil__ is a D compiler, and as such, it works a lot with strings, a lot of big
226 chunks of memory, a lot of small objects, it has it **all**! String manipulation
227 stress the GC a lot, because it uses objects (blocks) of all possible sizes
228 ever, specially extremely small objects (less than 8 bytes, even a lot of blocks
229 of just **one byte**!).
231 __ http://code.google.com/p/dil/
234 Here are the results of a run of Dil to generate Tango documentation (around 555
235 source files are processed):
240 .. image:: ##POST_URL##/14-memory-allocation/dil.rq.tot.png
242 .. image:: ##POST_URL##/14-memory-allocation/dil.rq.bin.png
244 .. image:: ##POST_URL##/14-memory-allocation/dil.ws.tot.png
246 .. image:: ##POST_URL##/14-memory-allocation/dil.ws.bin.png
248 Didn't I say it was colorful?
250 This is like the voronoi but taken to the extreme, it really have it all, it
251 allocates all types of objects in significant quantities, it wastes a lot of
252 memory (23%) and much more when used in precise mode (33%).
254 Here is the detailed summary:
256 .. include:: ##POST_DIR##/14-memory-allocation/dil.txt
262 I've analyzed other small fabricated benchmarks, but all of them had results
263 very similar to the ones shown here.
265 I think the overallocation problem is more serious than what one might think at
266 first sight. Bear in mind this is not GC overhead, is not because of internal GC
267 data. Is memory the GC or the mutator cannot use. Is memory wasted because of
268 fragmentation (planned fragmentation, but fragmentation at least). And I don't
269 think this is the worse problem. The worse problem is, this memory will need to
270 be scanned in most cases (Dil needs to scan 70% of the total memory requested),
271 and maybe the worse of all is that is subject to false pointer. A false pointer
272 to a memory location that is not actually being used by the program will keep
273 the block alive! If is a large object (several pages) that could be pretty
276 This problems can be addressed in several ways. One is mitigate the problem by
277 checking (when type information is available) what portions of the memory is
278 really used and what is wasted, and don't keep things alive when they are only
279 pointed to wasted memory. This is not free though, it will consume more CPU
280 cycles so the solution could be worse than the problem.
282 I think it worth experimenting with other heap organizations, for example,
283 I would experiment with one free list for object size instead of
284 pre-fixed-sizes. I would even experiment with a free list for each **type** when
285 type information is available, that would save a lot of space (internal GC
286 space) when storing type information. Some specialization for strings could be
289 Unfortunately I don't think I'll have the time to do this, at least for the
290 thesis, but I think is a very rich and interesting ground to experiment.
293 .. vim: set et sw=3 sts=3 :