source/blog/posts/2010/08/14-memory-allocation-patterns.rst

   1 Title: Memory allocation patterns
   2 Tags: en, d, gc, dgc, cdgc, memory, allocation, pattern, dil, benchmark
   3
   4 .. note::
   5
   6    Tango__ 0.99.9 has a bug__ in its runtime, which sometimes makes the GC scan
   7    memory that should not be scanned. It only affects Dil and Voronoi programs,
   8    but in a significant way. The tests in this post are done using a patched
   9    runtime, with the bug fixed.
  10
  11 __ http://www.dsource.org/projects/tango
  12 __ http://www.dsource.org/projects/tango/ticket/1971
  13
  14
  15 .. admonition:: Update
  16
  17    The results for the unpublished programs are now available__. You can find
  18    the graphic results, the detailed summary and the source code for all the
  19    programs (except dil, which can be downloaded from its home site).
  20
  21 __ ##POST_URL##/14-memory-allocation/
  22
  23
  24 After seeing some weird behaviours and how different benchmarks are more or less
  25 affected by changes like `memory addresses returned by the OS`__ or by
  26 `different ways to store the type information pointer`__, I decided to gather
  27 some information about how much and what kind of memory are requested by the
  28 different benchmarks.
  29
  30 __ /blog/blog/post/-7a56a111
  31 __ /blog/blog/post/1490c03e
  32
  33 I used the information provided by the ``malloc_stats_file`` CDGC__ option, and
  34 generated some stats.
  35
  36 __ /blog/blog/post/-2c067531
  37
  38 The analysis is done on the allocations requested by the program (calls to
  39 ``gc_malloc()``) and contrasting that with the real memory allocated by the GC.
  40 Note that only the GC heap memory (that is, memory dedicated to the program,
  41 which the GC scans in the collections) is counted (internal GC memory used for
  42 bookkeeping is not).
  43
  44 Also note that in this post I generally refer to object meaning a block of
  45 memory, it doesn't mean they are actually instance of a class or anything.
  46 Finally bear in mind that all the figures shown here are the sum of all the
  47 allocations done in the life of a program. If the collected data says a program
  48 requested 1GiB of memory, that doesn't mean the program had a residency of 1GiB,
  49 the program could had a working set of a few KiB and recycled memory like hell.
  50
  51 When analyzing the real memory allocated by the GC, there are two modes being
  52 analyzed, one is the classic conservative mode and the other is the precise mode
  53 (as it is in the original patch, storing the type information pointer at the end
  54 of the blocks). So the idea here is to measure two major things:
  55
  56 * The amount of memory wasted by the GC because of `how it arranges memory`__ as
  57   fixed-size blocks (*bins*) and large objects that uses whole pages.
  58 * The extra amount of memory wasted by the GC when using precise mode because it
  59   stores the type information pointer at the end of the blocks.
  60
  61 __ /blog/blog/post/250bf643
  62
  63 I've selected a few representative benchmarks. Here are the results:
  64
  65
  66 bh allocation pattern
  67 ---------------------
  68
  69 This is a translation__ by `Leonardo Maffi`__ from the `Olden Benchmark`__ that
  70 does a `Barnes–Hut simulation`__.  The program is CPU intensive an does a lot of
  71 allocation of about 5 different small objects.
  72
  73 __ http://www.fantascienza.net/leonardo/js/dolden_bh.zip
  74 __ http://www.fantascienza.net/leonardo/
  75 __ http://www.irisa.fr/caps/people/truong/M2COct99/Benchmarks/Olden/Welcome.html
  76 __ http://en.wikipedia.org/wiki/Barnes%E2%80%93Hut_simulation
  77
  78 Here is a graphic summary of the allocation requests and real allocated memory
  79 for a run with ``-b 4000``:
  80
  81 .. container:: center
  82
  83    .. image:: ##POST_URL##/14-memory-allocation/bh.rq.tot.png
  84
  85    .. image:: ##POST_URL##/14-memory-allocation/bh.rq.bin.png
  86
  87    .. image:: ##POST_URL##/14-memory-allocation/bh.ws.tot.png
  88
  89    .. image:: ##POST_URL##/14-memory-allocation/bh.ws.bin.png
  90
  91 We can easily see here how the space wasted by the GC memory organization is
  92 significant (about 15% wasted), and how the type information pointer is adding
  93 an even more significant overhead (about 36% of the memory is wasted). This
  94 means that this program will be 15% more subject to false pointers (and will
  95 have to scan some extra memory too, but fortunately the majority of the memory
  96 doesn't need to be scanned) than it should in conservative mode and that the
  97 precise mode makes things 25% worse.
  98
  99 You can also see how the extra overhead in the precise mode is because some
 100 objects that should fit in a 16 bin now need a 32 bytes bin to hold the extra
 101 pointer. See how there were no waste at all in the conservative mode for objects
 102 that should fit a 16 bytes bin. **117MiB are wasted because of that**.
 103
 104 Here is a more detailed (but textual) summary of the memory requested and
 105 allocated:
 106
 107 .. include:: ##POST_DIR##/14-memory-allocation/bh.txt
 108
 109
 110 bigarr allocation pattern
 111 -------------------------
 112
 113 This is a extremely simple program that just allocate a big array of
 114 small-medium objects (all of the same size) `I found in the D NG`__.
 115
 116 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=54084
 117
 118 Here is the graphic summary:
 119
 120 .. container:: center
 121
 122    .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.tot.png
 123
 124    .. image:: ##POST_URL##/14-memory-allocation/bigarr.rq.bin.png
 125
 126    .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.tot.png
 127
 128    .. image:: ##POST_URL##/14-memory-allocation/bigarr.ws.bin.png
 129
 130 The only interesting part of this test is how many space is wasted because of
 131 the memory organization, which in this case goes up to 30% for the conservative
 132 mode (and have no change for the precise mode).
 133
 134 Here is the detailed summary:
 135
 136 .. include:: ##POST_DIR##/14-memory-allocation/bigarr.txt
 137
 138
 139 mcore allocation pattern
 140 ------------------------
 141
 142 This is program that test the contention produced by the GC when appending to
 143 (thread-specific) arrays in several threads concurrently (again, `found at the
 144 D NG`__). For this analysis the concurrency doesn't play any role though, is
 145 just a program that do a lot of appending to a few arrays.
 146
 147 __ http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.D&article_id=103563
 148
 149 Here are the graphic results:
 150
 151 .. container:: center
 152
 153    .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.tot.png
 154
 155    .. image:: ##POST_URL##/14-memory-allocation/mcore.rq.bin.png
 156
 157    .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.tot.png
 158
 159    .. image:: ##POST_URL##/14-memory-allocation/mcore.ws.bin.png
 160
 161 This is the most boring of the examples, as everything works as expected =)
 162
 163 You can clearly see how the arrays grow, passing through each bin size and
 164 finally becoming big objects which take most of the allocated space. Almost
 165 nothing need to be scanned (they are int arrays), and practically there is no
 166 waste. That's a good decision by the array allocation algorithm, which seems to
 167 exploit the bin sizes to the maximum. Since almost all the data is doesn't need
 168 to be scanned, there is no need to store the type information pointers, so there
 169 is no waste either for the precise mode (the story would be totally different if
 170 the arrays were of objects that should be scanned, as probably each array
 171 allocation would waste about 50% of the memory to store the type information
 172 pointer).
 173
 174 Here is the detailed summary:
 175
 176 .. include:: ##POST_DIR##/14-memory-allocation/mcore.txt
 177
 178
 179 voronoi allocation pattern
 180 --------------------------
 181
 182 This is one of my favourites, because is always problematic. It "computes the
 183 `voronoi diagram`__ of a set of points recursively on the tree" and is also
 184 taken from the Olden Benchmark and translated__ by Leonardo Maffi to D.
 185
 186 __ http://en.wikipedia.org/wiki/Voronoi_diagram
 187 __ http://codepad.org/xGDCS3KO
 188
 189 Here are the graphic results for a run with ``-n 30000``:
 190
 191 .. container:: center
 192
 193    .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.tot.png
 194
 195    .. image:: ##POST_URL##/14-memory-allocation/voronoi.rq.bin.png
 196
 197    .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.tot.png
 198
 199    .. image:: ##POST_URL##/14-memory-allocation/voronoi.ws.bin.png
 200
 201 This have a little from all the previous examples. Practically all the heap
 202 should be scanned (as in **bigarr**), it wastes a considerably portion of the
 203 heap because of the fixed-size blocks (as all but **mcore**), it wastes just
 204 a very little more because of type information (as all but **bh**) but that
 205 waste comes from objects that should fit in a 16 bytes bin but is stored in a 32
 206 bytes bin instead (as in **bh**).
 207
 208 Maybe that's why it's problematic, it touches a little mostly all the GC flaws.
 209
 210 Here is the detailed summary:
 211
 212 .. include:: ##POST_DIR##/14-memory-allocation/voronoi.txt
 213
 214
 215 Dil allocation pattern
 216 ----------------------
 217
 218 Finally, this is by far my favourite, the only real-life program, and the most
 219 colorful example (literally =).
 220
 221 Dil__ is a D compiler, and as such, it works a lot with strings, a lot of big
 222 chunks of memory, a lot of small objects, it has it **all**! String manipulation
 223 stress the GC a lot, because it uses objects (blocks) of all possible sizes
 224 ever, specially extremely small objects (less than 8 bytes, even a lot of blocks
 225 of just **one byte**!).
 226
 227 __ http://code.google.com/p/dil/
 228
 229
 230 Here are the results of a run of Dil to generate Tango documentation (around 555
 231 source files are processed):
 232
 233 .. container:: center
 234
 235    .. image:: ##POST_URL##/14-memory-allocation/dil.rq.tot.png
 236
 237    .. image:: ##POST_URL##/14-memory-allocation/dil.rq.bin.png
 238
 239    .. image:: ##POST_URL##/14-memory-allocation/dil.ws.tot.png
 240
 241    .. image:: ##POST_URL##/14-memory-allocation/dil.ws.bin.png
 242
 243 Didn't I say it was colorful?
 244
 245 This is like the voronoi but taken to the extreme, it really have it all, it
 246 allocates all types of objects in significant quantities, it wastes a lot of
 247 memory (23%) and much more when used in precise mode (33%).
 248
 249 Here is the detailed summary:
 250
 251 .. include:: ##POST_DIR##/14-memory-allocation/dil.txt
 252
 253
 254 Conclusion
 255 ----------
 256
 257 I've analyzed other small fabricated benchmarks, but all of them had results
 258 very similar to the ones shown here.
 259
 260 I think the overallocation problem is more serious than what one might think at
 261 first sight. Bear in mind this is not GC overhead, is not because of internal GC
 262 data. Is memory the GC or the mutator cannot use. Is memory wasted because of
 263 fragmentation (planned fragmentation, but fragmentation at least). And I don't
 264 think this is the worse problem. The worse problem is, this memory will need to
 265 be scanned in most cases (Dil needs to scan 70% of the total memory requested),
 266 and maybe the worse of all is that is subject to false pointer. A false pointer
 267 to a memory location that is not actually being used by the program will keep
 268 the block alive! If is a large object (several pages) that could be pretty
 269 nasty.
 270
 271 This problems can be addressed in several ways. One is mitigate the problem by
 272 checking (when type information is available) what portions of the memory is
 273 really used and what is wasted, and don't keep things alive when they are only
 274 pointed to wasted memory. This is not free though, it will consume more CPU
 275 cycles so the solution could be worse than the problem.
 276
 277 I think it worth experimenting with other heap organizations, for example,
 278 I would experiment with one free list for object size instead of
 279 pre-fixed-sizes. I would even experiment with a free list for each **type** when
 280 type information is available, that would save a lot of space (internal GC
 281 space) when storing type information. Some specialization for strings could be
 282 useful too.
 283
 284 Unfortunately I don't think I'll have the time to do this, at least for the
 285 thesis, but I think is a very rich and interesting ground to experiment.
 286
 287
 288 .. vim: set et sw=3 sts=3 :