source/blog/posts/2010/08/14-memory-allocation-patterns.rst

   1 Title: Memory allocation patterns
   2 Tags: en, d, gc, dgc, cdgc, memory, allocation, pattern, dil, benchmark
   3
   4 .. note::
   5
   6    Tango__ 0.99.9 has a bug__ in its runtime, which sometimes makes the GC scan
   7    memory that should not be scanned. It only affects Dil and Voronoi programs,
   8    but in a significant way. The tests in this post are done using a patched
   9    runtime, with the bug fixed.
  10
  11 __ http://www.dsource.org/projects/tango
  12 __ http://www.dsource.org/projects/tango/ticket/1971
  13
  14
  15 .. admonition:: Update
  16
  17    The results for the unpublished programs are now available__. You can find
  18    the graphic results, the detailed summary and the source code for all the
  19    programs (except dil, which can be downloaded from its home site).
  20
  21 __ ##POST_URL##/14-memory-allocation/
  22
  23
  24 After seeing some weird behaviours and how different benchmarks are more or less
  25 affected by changes like `memory addresses returned by the OS`__ or by
  26 `different ways to store the type information pointer`__, I decided to gather
  27 some information about how much and what kind of memory are requested by the
  28 different benchmarks.
  29
  30 __ http://www.llucax.com.ar/blog/blog/post/-7a56a111
  31 __ http://www.llucax.com.ar/blog/blog/post/1490c03e
  32
  33 I used the information provided by the ``malloc_stats_file`` CDGC__ option, and
  34 generated some stats.
  35
  36 __ http://www.llucax.com.ar/blog/blog/post/-2c067531
  37
  38 The analysis is done on the allocations requested by the program (calls to
  39 ``gc_malloc()``) and contrasting that with the real memory allocated by the GC.
  40 Note that only the GC heap memory (that is, memory dedicated to the program,
  41 which the GC scans in the collections) is counted (internal GC memory used for
  42 bookkeeping is not).
  43
  44 Also note that in this post I generally refer to object meaning a block of
  45 memory, it doesn't mean they are actually instance of a class or anything.
  46 Finally bear in mind that all the figures shown here are the sum of all the
  47 allocations done in the life of a program. If the collected data says a program
  48 requested 1GiB of memory, that doesn't mean the program had a residency of 1GiB,
  49 the program could had a working set of a few KiB and recycled memory like hell.
  50
  51 When analyzing the real memory allocated by the GC, there are two modes being
  52 analyzed, one is the classic conservative mode and the other is the precise mode
  53 (as it is in the original patch, storing the type information pointer at the end
  54 of the blocks). So the idea here is to measure two major things:
  55
  56 * The amount of memory wasted by the GC because of `how it arranges memory`__ as
  57   fixed-size blocks (*bins*) and large objects that uses whole pages.
  58 * The extra amount of memory wasted by the GC when using precise mode because it
  59   stores the type information pointer at the end of the blocks.
  60
  61 __ http://www.llucax.com.ar/blog/blog/post/250bf643
  62
  63 I've selected a few representative benchmarks. Here are the results:
  64
  65
  66 bh allocation pattern
  67 ---------------------
  68
  69 This is a translation__ by `Leonardo Maffi`__ from the `Olden Benchmark`__ that
  70 does a `Barnes–Hut simulation`__.  The program is CPU intensive an does a lot of
  71 allocation of about 5 different small objects.
  72
  73 __ http://www.fantascienza.net/leonardo/js/dolden_bh.zip
  74 __ http://www.fantascienza.net/leonardo/
  75 __ http://www.irisa.fr/caps/people/truong/M2COct99/Benchmarks/Olden/Welcome.html
  76 __ http://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation
  77
  78 Here is a graphic summary of the allocation requests and real allocated memory
  79 for a run with ``-b 4000``:
  80
  81 .. container::
  82    :align: center
  83
  84    .. image:: ##POST_URL##/14-memory-allocation/bh.rq.tot.png
  85
  86    .. image:: ##POST_URL##/14-memory-allocation/bh.rq.bin.png
  87
  88    .. image:: ##POST_URL##/14-memory-allocation/bh.ws.tot.png
  89
  90    .. image:: ##POST_URL##/14-memory-allocation/bh.ws.bin.png
  91
  92 We can easily see here how the space wasted by the GC memory organization is
  93 significant (about 15% wasted), and how the type information pointer is adding
  94 an even more significant overhead (about 36% of the memory is wasted). This
  95 means that this program will be 15% more subject to false pointers (and will
  96 have to scan some extra memory too, but fortunately the majority of the memory
  97 doesn't need to be scanned) than it should in conservative mode and that the
  98 precise mode makes things 25% worse.
  99
 100 You can also see how the extra overhead in the precise mode is because some
 101 objects that should fit in a 16 bin now need a 32 bytes bin to hold the extra
 102 pointer. See how there were no waste at all in the conservative mode for objects
 103 that should fit a 16 bytes bin. **117MiB are wasted because of that**.
 104
 105 Here is a more detailed (but textual) summary of the memory requested and
 106 allocated:
 107
 108 .. include:: ##POST_DIR##/14-memory-allocation/bh.txt
 109
 110
 111 bigarr allocation pattern
 112 -------------------------
 113
 114 This is a extremely simple program that just allocate a big array of
 115 small-medium objects (all of the same size) `I found in the D NG`__.
 116
 117 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=54084
 118
 119 Here is the graphic summary:
 120
 121 .. container::
 122    :align: center
 123
 124    .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.tot.png
 125
 126    .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.bin.png
 127
 128    .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.tot.png
 129
 130    .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.bin.png
 131
 132 The only interesting part of this test is how many space is wasted because of
 133 the memory organization, which in this case goes up to 30% for the conservative
 134 mode (and have no change for the precise mode).
 135
 136 Here is the detailed summary:
 137
 138 .. include:: ##POST_DIR##/14-memory-allocation/bigarr.txt
 139
 140
 141 mcore allocation pattern
 142 ------------------------
 143
 144 This is program that test the contention produced by the GC when appending to
 145 (thread-specific) arrays in several threads concurrently (again, `found at the
 146 D NG`__). For this analysis the concurrency doesn't play any role though, is
 147 just a program that do a lot of appending to a few arrays.
 148
 149 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=103563
 150
 151 Here are the graphic results:
 152
 153 .. container::
 154    :align: center
 155
 156    .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.tot.png
 157
 158    .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.bin.png
 159
 160    .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.tot.png
 161
 162    .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.bin.png
 163
 164 This is the most boring of the examples, as everything works as expected =)
 165
 166 You can clearly see how the arrays grow, passing through each bin size and
 167 finally becoming big objects which take most of the allocated space. Almost
 168 nothing need to be scanned (they are int arrays), and practically there is no
 169 waste. That's a good decision by the array allocation algorithm, which seems to
 170 exploit the bin sizes to the maximum. Since almost all the data is doesn't need
 171 to be scanned, there is no need to store the type information pointers, so there
 172 is no waste either for the precise mode (the story would be totally different if
 173 the arrays were of objects that should be scanned, as probably each array
 174 allocation would waste about 50% of the memory to store the type information
 175 pointer).
 176
 177 Here is the detailed summary:
 178
 179 .. include:: ##POST_DIR##/14-memory-allocation/mcore.txt
 180
 181
 182 voronoi allocation pattern
 183 --------------------------
 184
 185 This is one of my favourites, because is always problematic. It "computes the
 186 `voronoi diagram`__ of a set of points recursively on the tree" and is also
 187 taken from the Olden Benchmark and translated__ by Leonardo Maffi to D.
 188
 189 __ http://en.wikipedia.org/wiki/Voronoi_diagram
 190 __ http://codepad.org/xGDCS3KO
 191
 192 Here are the graphic results for a run with ``-n 30000``:
 193
 194 .. container::
 195    :align: center
 196
 197    .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.tot.png
 198
 199    .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.bin.png
 200
 201    .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.tot.png
 202
 203    .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.bin.png
 204
 205 This have a little from all the previous examples. Practically all the heap
 206 should be scanned (as in **bigarr**), it wastes a considerably portion of the
 207 heap because of the fixed-size blocks (as all but **mcore**), it wastes just
 208 a very little more because of type information (as all but **bh**) but that
 209 waste comes from objects that should fit in a 16 bytes bin but is stored in a 32
 210 bytes bin instead (as in **bh**).
 211
 212 Maybe that's why it's problematic, it touches a little mostly all the GC flaws.
 213
 214 Here is the detailed summary:
 215
 216 .. include:: ##POST_DIR##/14-memory-allocation/voronoi.txt
 217
 218
 219 Dil allocation pattern
 220 ----------------------
 221
 222 Finally, this is by far my favourite, the only real-life program, and the most
 223 colorful example (literally =).
 224
 225 Dil__ is a D compiler, and as such, it works a lot with strings, a lot of big
 226 chunks of memory, a lot of small objects, it has it **all**! String manipulation
 227 stress the GC a lot, because it uses objects (blocks) of all possible sizes
 228 ever, specially extremely small objects (less than 8 bytes, even a lot of blocks
 229 of just **one byte**!).
 230
 231 __ http://code.google.com/p/dil/
 232
 233
 234 Here are the results of a run of Dil to generate Tango documentation (around 555
 235 source files are processed):
 236
 237 .. container::
 238    :align: center
 239
 240    .. image:: ##POST_URL##/14-memory-allocation/dil.rq.tot.png
 241
 242    .. image:: ##POST_URL##/14-memory-allocation/dil.rq.bin.png
 243
 244    .. image:: ##POST_URL##/14-memory-allocation/dil.ws.tot.png
 245
 246    .. image:: ##POST_URL##/14-memory-allocation/dil.ws.bin.png
 247
 248 Didn't I say it was colorful?
 249
 250 This is like the voronoi but taken to the extreme, it really have it all, it
 251 allocates all types of objects in significant quantities, it wastes a lot of
 252 memory (23%) and much more when used in precise mode (33%).
 253
 254 Here is the detailed summary:
 255
 256 .. include:: ##POST_DIR##/14-memory-allocation/dil.txt
 257
 258
 259 Conclusion
 260 ----------
 261
 262 I've analyzed other small fabricated benchmarks, but all of them had results
 263 very similar to the ones shown here.
 264
 265 I think the overallocation problem is more serious than what one might think at
 266 first sight. Bear in mind this is not GC overhead, is not because of internal GC
 267 data. Is memory the GC or the mutator cannot use. Is memory wasted because of
 268 fragmentation (planned fragmentation, but fragmentation at least). And I don't
 269 think this is the worse problem. The worse problem is, this memory will need to
 270 be scanned in most cases (Dil needs to scan 70% of the total memory requested),
 271 and maybe the worse of all is that is subject to false pointer. A false pointer
 272 to a memory location that is not actually being used by the program will keep
 273 the block alive! If is a large object (several pages) that could be pretty
 274 nasty.
 275
 276 This problems can be addressed in several ways. One is mitigate the problem by
 277 checking (when type information is available) what portions of the memory is
 278 really used and what is wasted, and don't keep things alive when they are only
 279 pointed to wasted memory. This is not free though, it will consume more CPU
 280 cycles so the solution could be worse than the problem.
 281
 282 I think it worth experimenting with other heap organizations, for example,
 283 I would experiment with one free list for object size instead of
 284 pre-fixed-sizes. I would even experiment with a free list for each **type** when
 285 type information is available, that would save a lot of space (internal GC
 286 space) when storing type information. Some specialization for strings could be
 287 useful too.
 288
 289 Unfortunately I don't think I'll have the time to do this, at least for the
 290 thesis, but I think is a very rich and interesting ground to experiment.
 291
 292
 293 .. vim: set et sw=3 sts=3 :