Demonstration of garbage collection being faster than manual memory management

Question

I've read in many places (heck, I've even written so myself) that garbage collection could (theoretically) be faster than manual memory management.

However, showing is a lot harder to come by than telling.
I have never actually seen any piece of code that demonstrates this effect in action.

Does anyone have (or know where I can find) code that demonstrates this performance advantage?

the problem with GC is that most implementations are not deterministic so 2 runs can have vastly different results, not to mention it's hard to isolate the right variables to compare — ratchet freak, Jul 04 '13 at 19:49
@ratchetfreak: If you know of any examples that are only faster (say) 70% of the time, that's fine with me too. There must be some way to compare the two, in terms of throughput at least (latency probably wouldn't work). — user541686, Jul 04 '13 at 19:56
Are you measuring machine performance or developer performance? Not having to think as hard about memory management could significantly reduce a developer's "days to delivered product" if the performance specs permit. — Dan Pichelman, Jul 04 '13 at 20:12
@DanPichelman: Machine performance. Examples of the claims I've seen are here. — user541686, Jul 04 '13 at 20:13
Well, this is a bit tricky because you could always manually do whatever gives the GC an edge over what you did manually. Perhaps it's better to restrict this to "standard" manual memory management tools (malloc()/free(), owned pointers, shared pointers with refcount, weak pointers, no custom allocators)? Or, if you permit custom allocators (which may be more realistic or less realistic, depending on what kind of programmer you assume), put restrictions on the effort put into those allocators. Otherwise, the manual strategy "copy what the GC does in this case" is always at least as fast as GC. — , Jul 04 '13 at 20:41
@delnan: I don't see how you can have a "copy what the GC does" is possible though. A GC looks through the stack, static data segments, etc. in known locations to find references to the object graph's roots, but that's impossible in a native language C++ because there's no way to discover references like that, unless you make your own compiler (but then your code is restricted to that compiler). — user541686, Jul 04 '13 at 20:52
By "copy what the GC does" I didn't mean "build your own GC" (though note that this is theoretically possible in C++11 and beyond, which introduces optional support for a GC). I meant, as I've worded it earlier in the same comment, "do what gives the GC an edge over what you did manually". For example, if Cheney-like compaction helps this application a lot, you might manually implement a similar allocation + compaction scheme, with custom smart pointers to handle pointer fixup. Also, with techniques like a shadow stack, you can do root finding in C or C++, at the expense of extra work. — , Jul 04 '13 at 21:15
@delnan: Oh, I see what you mean now, that's a great point, thanks for bringing it up! — user541686, Jul 04 '13 at 21:22
It reminds me of the argument that JIT compiled languages can in theory be faster than native languages. Except they never are. You always give something up by moving to a higher level of abstractions, which is what GC is. It's the no free lunch principle. — Guy Sirton, Jul 04 '13 at 21:28
@GuySirton s/native/AOT compiled/. Also, yes and no. In the JIT-vs-AOT case, it's the AOT compiler writers' skill vs the JIT compiler writers' skill. In this case, it's the GC writers' skill vs the skill of who manages the memory, which is rarely a highly qualified expert who has worked on making it fast for years. Not that this necessarily changes the outcome... — , Jul 04 '13 at 21:38
@GuySirton There are many real cases where programs in Java have outperformed programs in C++. http://keithlea.com/javabench/ contains some trivial examples. Of course in all such cases, the C++ program can be improved. But if you exclude cases based on that, we wind up in a "No True Scotsman" type fallacy where we are only allowed to compare perfectly optimized C++ programs with anything else. — btilly, Jul 04 '13 at 22:16
@btilly: http://benchmarksgame.alioth.debian.org/ http://stackoverflow.com/questions/145110/c-performance-vs-java-c http://readwrite.com/2011/06/06/cpp-go-java-scala-performance-benchmark#awesm=~oaFvXDN4yyXkAt . Really my statement stands, you always give something up when you use higher abstractions. Trivially, GC can never be faster because whatever happens during GC that gives you performance can be mimicked using manual management but the opposite is not true. — Guy Sirton, Jul 04 '13 at 23:30
@GuySirton: I can't reproduce the keithlea.com/javabench results. I just tried out the heapsort implementation, and even when comparing the output of my old C++ compiler (Visual C++ 13.10.4035) with the one from JRE 7, C++ beats Java quite noticeably. If you can reproduce any of them let me know which one and I'll try that one. — user541686, Jul 04 '13 at 23:39
@GuySirton: Then again, they don't even seem to be benchmarking GCs in the first place -- they seem to just be comparing C++ to Java, with preallocated storage... — user541686, Jul 04 '13 at 23:42
One can always make manual memory management as efficient as automatic garbage collection, probably with lower constant factors. The problem is the engineering cost: GC invisibly handles all the bookkeeping and special cases for you; if you manage memory by hand, you pretty much set your implementation in stone -- no algorithmic optimisation for you! In practice, the relative costs of automatic GC are small. — Rafe, Jul 05 '13 at 00:00
@GuySirton Your link matches what I said. There are real programs whose implementation in a higher language can beat the implementation in a lower one. Yes, in theory you can win in the lower language. Doing so is not always easy. — btilly, Jul 05 '13 at 02:10
@Mehrdad I have not tried to reproduce numbers. I've seen enough people claim specific examples that I believe the principle. Heck, the blog I pointed to demonstrates it with C++ vs C#. And I've personally done it with Perl vs C! (The C that I was replacing was definitely "not high quality".) — btilly, Jul 05 '13 at 02:13
@Rafe I disbelieve your claim of great algorithmic benefits to managed memory. Starting from a base of using RIAA and tricks like std::shared_ptr you have about as much freedom to make algorithmic changes to your program as you do with managed memory. You'll need to dance carefully around circular references. You'll need to do more work and be more careful. But you can do it. And it usually isn't that much harder. (Until you over optimize. Then you're hosed.) — btilly, Jul 05 '13 at 02:23
A final note. While some programs could be sped up by porting to a managed memory model, manual memory can do tricks that managed memory can't touch. For example I have a program with close to 1 million objects, each of which has a list of things associated with it. I put all of the lists, in order, in an arena. If any list gets too big I reallocate the whole arena. This gives me excellent utilization of CPU cache (3x speedup when I did it). A GC that traced my code enough to realize this was a good idea would be too slow because of the tracing! — btilly, Jul 05 '13 at 02:34
@btilly -- My claim concerning algorithmic optimisation opportunities comes from the fact that, with managed code, you make NO commitments to how your memory is allocated. As soon as you DO make such commitments, which is unavoidable with manual memory management, you can't make changes to your algorithm -- or even non-trivial changes to your implementation -- without also changing your memory management strategy. That is a lot of error-prone work which may well perform worse! — Rafe, Jul 05 '13 at 04:56
@btilly -- I do not disagree that manual memory management has its place. I would, however, argue that that place is a very small, specialised place. For the most part, in most circumstances, it just ain't worth the effort. — Rafe, Jul 05 '13 at 04:57
@Rafe On algorithmic opportunities, my sense is that with unmanaged the effort of changing the algorithm is more than with managed..but by similar to the effort multiple of effort to choose unmanaged in the first place. (Bug opportunities are greater as well, part of unmanaged life.) So unmanaged is not a fundamental algorithmic optimization barrier. If you disagree, that should be a different conversation, with concrete examples. — btilly, Jul 05 '13 at 14:35
@Rafe And on the place of manual memory management, we agree. In the last 5 years I've spent perhaps 5% of my time working with unmanaged code for performance reasons, and I've spent over 80% of it inside of SQL, Perl and Python. Your mileage almost certainly varies. Among most of my co-workers that percentage of unmanaged is very much on the high side. But I know people who spend most of their time on unmanaged code, and I'm very glad that they do so. (Hard real time code driving a rocket does not want GC pauses. Really.) — btilly, Jul 05 '13 at 14:41
@Rafe So know what you're doing, and why you're doing it. Only do the hard stuff if you know why you're doing it and can prove that it is necessary. But if you've proven it (which in my case has always included writing a failed prototype first), do not hesitate to do what you need to do. — btilly, Jul 05 '13 at 14:44
@btilly -- Hi, I thought I made that clear when I wrote, "I do not disagree that manual memory management has its place." Whenever you make a commitment to a particular implementation choice (such as manual memory management), you increase (dramatically) the amount of effort it takes to change that decision. I can't see how this is controversial. — Rafe, Jul 08 '13 at 00:45
@Ike: It's okay. See why I asked the question though? That was the entire point of my question -- people come up with all sorts of explanations that should make sense but everyone stumbles when you ask them to provide a demonstration that proves what they say is correct in practice. The entire point of this question was to once and for all show that this can actually happen in practice. — user541686, Jan 06 '16 at 04:24
@Mehrdad Very much and very guilty! I've been studying the Java garbage collector lately a whole lot, trying to kind of reverse engineer how it works and implement a similar scheme in C++ (but something we can opt into). It's why I got all excited and kind of lost track of the whole point of the question. — , Jan 06 '16 at 04:28
@Mehrdad If you don't mind a crude explanation of that article, even in C++ we can potentially allocate memory a whole lot faster if we just allocated it in a straight sequential fashion using pooled, contiguous memory chained together. Now all complex alloc techniques required to generalize go away as well as page faults per chunk, the only prob is that we can't free any variable-sized chunk individually... that is, unless we had some deferred process that could copy the memory elsewhere using a more expensive strategy and did that in a separate thread. That's how eden alloc works. — , Jan 06 '16 at 04:35
@Mehrdad It's actually more expensive if you consider total CPU time.. but it's cheaper in terms of not stalling the thread allocating the memory by allowing it to allocate using the cheapest allocation technique possible. The expensive work to allocate individual chunks to be freed is then deferred to another thread. So its basic speed comes from just deferring the expensive stuff for later in a background thread. C++ would pay those full costs upfront in the same thread allocating the memory. — , Jan 06 '16 at 04:38
@Ike : Deferring to a background thread is fine but then you also have to take into account how long (in real time, i.e. in seconds) the background thread is actively running on the CPU as well. Not doing so would be cheating since a manual scheme that doesn't use that CPU core would have been able to get higher performance by using that core. Also, since this is a practical question, I'm not looking for obscure edge cases such as custom C++ allocators that are specialized for one use case. We're just using the standard allocators in each implementation, not attempting to bypass them. — user541686, Jan 06 '16 at 04:47
@Mehrdad Yeah -- in terms of real time GC has a very strong potential edge when they use this kind of eden space strategy, as it's balancing the load across threads, allowing the thread allocating to use a much cheaper allocation strategy than malloc, e.g. Chen had to kind of "cheat" and reach for the pool to beat it -- so C and C++ can potentially beat GC, but not with "normal/average" code (though I feel weird about this part since in my field it's pretty much mandatory to use preallocated mem pools in many cases). I've been looking to do something similar to what the Java GC does in... — , Jan 06 '16 at 09:41
@Mehrdad C and C++ for things like classes with pimpls and polymorphic classes captured through a base pointer, since this kind of multithreaded GC approach offers the best potential in those cases to allocate quickly and actually get a little better locality of ref. Tree structures are still often best tackled by a fixed alloc or sequential alloc, but pimpls and polymorphic classes can really benefit from this kind of "defer the expensive allocation/deallocation to another thread" strategy on hardware that benefits from multiple threads. — , Jan 06 '16 at 09:43

score 28 · Answer 1 · edited Oct 14 '22 at 08:09

28

See Link and follow all of the links to see Rico Mariani vs Raymond Chen (both very competent programmers at Microsoft) dueling it out. Raymond would improve the unmanaged one, Rico would respond by optimizing the same thing in the managed ones.

With essentially zero optimization effort, the managed versions started off many times faster than the manual. Eventually the manual beat the managed, but only by optimizing to a level that most programmers would not want to go to. In all versions, the memory usage of the manual was significant better than the managed.

edited Oct 14 '22 at 08:09

Glorfindel

3,137

answered Jul 04 '13 at 20:44

btilly

18,290
1
50
76

+1 for citing an actual example with code :) although proper use of C++ constructs (such as swap) isn't that hard, and would probably get you there quite easily performance-wise... – user541686 Jul 04 '13 at 21:11
5

You may be able to outdo Raymond Chen on performance. I am confident that I can't unless he's out of it due to being sick, I'm working many times harder, and I got lucky. I don't know why he didn't choose the solution you would have chosen. I'm sure he had reasons for it – btilly Jul 04 '13 at 21:56
I copied Raymond's code here, and to compare, I wrote my own version here. The ZIP file that contains the text file is here. On my computer, mine runs in 14 ms and Raymond's runs in 21 ms. Unless I did something wrong (which is possible), his 215-line code is 50% slower than my 48-line implementation, even without using memory-mapped files or custom memory pools (which he did use). Mine is half as long as the C# version. Did I do it wrong, or do you observe the same thing? – user541686 Jul 05 '13 at 08:39
1

@Mehrdad Pulling out an old copy of gcc on this laptop I can report that neither your code nor his will compile, let alone do anything with it. The fact that I'm not on Windows likely explains that. But let's assume that your numbers and code are correct. Do they perform the same on a decade old compiler and computer? (Look at when the blog was written.) Maybe, maybe not. Let's suppose that they are, that he (being a C programmer) did not know how to use C++ properly, etc. What are we left with? – btilly Jul 05 '13 at 14:29
1

We are left with a reasonable C++ program which can be translated into managed memory and sped up. But where the C++ version can be optimized and sped up farther. Which is what we all are in agreement is the general pattern that always happens when managed code is faster than unmanaged. However we still have a concrete example of reasonable code from a good programmer that was faster in a managed version. – btilly Jul 05 '13 at 14:31
This link is now dead. Does anyone have an updated reference? – Alex Reinking Jun 15 '20 at 04:13
@AlexReinking The wayback machine didn't archive it. I don't have a copy. Sorry. – btilly Jun 15 '20 at 18:50

score 5 · Answer 2 · answered Jul 04 '13 at 23:53

5

The rule of thumb is that there are no free lunches.

GC takes away the headache of manual memory management and reduces the probability of making mistakes. There are some situations where some particular GC strategy is the optimal solution for the problem, in which case you'll pay no penalty for using it. But there are others where other solutions will be faster. Since you can always simulate higher abstractions from a lower level but not the other way around you can effectively prove that there is no way higher abstractions can be faster than the lower ones in the general case.

GC is a special case of manual memory management

It may be a lot of work or more error prone to get better performance manually but that's a different story.

answered Jul 04 '13 at 23:53

Guy Sirton

1,885

1

That makes no sense to me. To give you a couple of concrete examples: 1) the allocators and write barriers in production GCs are hand-written assembler because C is too inefficient so how will you beat that from C, and 2) tail call elimination is an example of an optimisation done in high-level (functional) languages that is not done by the C compiler and, therefore, cannot be done in C. Stack walking is another example of something done below the level of C by high-level languages. – J D Jan 27 '16 at 00:03
3
I'd have to see the specific code to comment but if the hand written allocators/barriers in assembler are faster then use hand written assembler. Not sure what that has to do with GC. 2) Take a look here: http://stackoverflow.com/a/9814654/441099 the point is not whether some non-GC language can do tail recursion elimination for you. The point is that you can transform your code to be as fast or faster. Whether the compiler of some specific language can do this for you automatically is a matter of convenience. In a low enough abstraction you can always do this yourself if you wish.

Guy Sirton

Jan 27 '16 at 00:34

1

That tail call example in C only works for the special case of a function calling itself. C cannot handle the general case of functions tail calling each other. Dropping to assembler and assuming infinite time for development is a Turing tarpit. – J D Jan 27 '16 at 00:54

score 3 · Answer 3 · answered Jul 04 '13 at 20:41

3

It's easy to construct an artificial situation where GC is infinitely more efficient than manual methods - just arrange that there is only one "root" for the garbage collector, and that everything is garbage, so the GC step is instantly completed.

If you think about it, that is the model used when garbage collecting the memory allocated to a processes. The process dies, all it's memory is garbage, we're done. Even in practical terms, a process that starts, runs, and dies leaving no trace might be more efficient than one that starts and runs forever.

For practical programs, written in languages with garbage collection, the advantage of garbage collection is not speed but correctness, and simplicity.

answered Jul 04 '13 at 20:41

ddyer

4,080

If it's easy to construct an artificial example, would you mind showing a simple one? – user541686 Jul 04 '13 at 20:46
2

@Mehrdad He did explain a simple one. Write a program where the GC version fails to do a garbage run before exiting. The manual memory managed version will be slower because it was explicitly tracking and freeing stuff. – btilly Jul 04 '13 at 20:52
3

@btilly: "Write a program where the GC version fails to do a garbage run before exiting."... failing to do garbage collection in the first place is a memory leak due to the lack of a functioning GC, not a performance improvement due to the presence of a GC! That's like calling abort() in C++ before the program exits. It's a meaningless comparison; you're not even garbage-collecting, you're just letting memory leak. You can't say garbage collection is faster (or slower) if you're not garbage-collecting to begin with... – user541686 Jul 04 '13 at 20:56
To make an extreme example, you'd have to define a complete system with your own heap and heap management, which would be a great student project but too large to fit in this margin. You'd do pretty well by writing a program that allocates and deallocates random sized arrays, in a way designed to be stressful to non-gc memory management methods. – ddyer Jul 04 '13 at 21:00
3

@Mehrdad Not so. The scenario is that the GC version never happened to hit the threshold at which it would have done a run, not that it would have failed to perform correctly on a different data set. That is trivially going to be very good for the GC version, though not a good predictor of eventual performance. – btilly Jul 04 '13 at 21:04
@btilly: How is that any different from calling abort() in the corresponding location in the C++ code? – user541686 Jul 04 '13 at 21:05
@ddyer: So when people say (for example) "garbage collection is often faster than manual memory management", they're thinking of extreme examples like that? I find it hard to believe these statements (which are all over the web) require such extreme and difficult-to-demonstrate examples to be actually true... – user541686 Jul 04 '13 at 21:08
That's a handwave argument. It's very difficult to compare apples to apples, but the point is they can be comparable. The one way in which GC'd programs are always faster is they are faster to develop and debug. – ddyer Jul 04 '13 at 21:12
@ddyer You can make a realistic extreme example with a multi-threaded program that generates lots of small tokens while parsing strings. A simplistic manual version will kill you on not just reference checking insanely often, but also locking mutexes while doing so. See my answer for a realistic example and what can be done about it, albeit without the multi-threaded pessimization. – btilly Jul 04 '13 at 21:14
@ddyer: Hmm, what exactly is so difficult to compare here? For example, let's say you're doing some computation (e.g. an FFT) on a lot of data (say, 100 million data points). A very easy and objective way to compare them would be to compare the number of data points processed per second, over the entire life of the program, and show that a GC is faster (or slower). That's a 100% objective and hard measure of performance, and the fact that there was a lot of data shows you're not measuring artifacts over small datasets. How is that hand-wavy? It seems perfectly comparable to me. – user541686 Jul 04 '13 at 21:14
GC and non-GC don't mix well, so you can't easily just throw a switch to compare the "same" program with and without. Even if you can, the manual and gc'd strategies can have radically different memory usage profiles, and generally different real-world performance for a lot of reasons. – ddyer Jul 04 '13 at 21:25
In any case, the most relevant question is how do you compare a GC'd program to a manually managed program with a management bug, which crashes or produces incorrect results. – ddyer Jul 04 '13 at 21:28
@ddyer Actually for many C and C++ programs, the translation to using http://www.hpl.hp.com/personal/Hans_Boehm/gc/ is relatively straightforward. Many have done it, performance can wind up going either way by a constant factor. – btilly Jul 04 '13 at 21:53
@ddyer: Actually, GC and scope-based resource management should complement each other nicely; I find it rather unfortunate that languages don't try to combine the benefits of both. – supercat Jun 19 '14 at 01:53
Except even here GC needs to regularly walk through the reference graph to check if GC is needed. Isn't free. – mczarnek Jun 30 '22 at 20:29
Also.. you can skip freeing memory in manual memory management too. malloc one single block of memory initially, then use point arthimetic to grab more blocks.. same speed. Manual memory management can always meet or beat GC. Just sometimes better too. – mczarnek Jul 15 '22 at 23:39
@user541686 Let's slightly change the setup: you have a program the 99% of the time doesn't need to garbage collect anything, and is hands-down faster those 99%, but 1% of the time does, and is slightly edged out by the manually managed one. I would call that "GC is more efficient" – Caleth Oct 17 '22 at 08:43

score 3 · Answer 4 · answered Jan 27 '16 at 00:15

3

I have done quite a bit of work on this and described some of it here. I benchmarked the Boehm GC in C++, allocating using malloc but not freeing, allocating and freeing using free and a custom-built mark-region GC written in C++ all vs OCaml's stock GC running a list-based n-queens solver. OCaml's GC was faster in all cases. The C++ and OCaml programs were deliberately written to perform the same allocations in the same order.

You can, of course, rewrite the programs to solve the problem using only 64-bit integers and no allocations. Although faster that would defeat the point of the exercise (which was to predict the performance of a new GC algorithm I was working on using a prototype built in C++).

I have spent many years in industry porting real C++ code to managed languages. In almost every single case I observed substantial performance improvements, many of which were probably due to GC trumping manual memory management. The practical limitation is not what can be accomplished in a microbenchmark but what can be accomplished before a deadline and GC-based languages offer such huge productivity improvements that I never looked back. I still use C and C++ on embedded devices (microcontrollers) but even that is changing now.

answered Jan 27 '16 at 00:15

J D

2,332

+1 thanks. Where can we see and run the benchmark code? – user541686 Jan 27 '16 at 00:26
The code is scattered about the place. I posted the mark-region version here: https://groups.google.com/d/msg/pragmatic-functional-programming-research/E5sKxzx9YoU/LXHPGaa2wPQJ – J D Jan 27 '16 at 00:36
The shared_ptr version is here: https://www.blogger.com/comment.g?blogID=1779668156488517434&postID=4123532808044156084 – J D Jan 27 '16 at 00:37
The OCaml is here: http://ocamlnews.blogspot.co.uk/2007/01/solving-n-queens-problem.html – J D Jan 27 '16 at 00:49
I'll take a look but right off the bat I can say I wouldn't expect shared_ptr to be the right thing to compare since it adds extra overhead for its own thread safety. – user541686 Jan 27 '16 at 04:00
1

There are results for both thread safe and unsafe in there. – J D Jan 27 '16 at 05:02
I see. Can you tell me how to run the OCaml code? I don't know OCaml but I tried running it here and got a runtime error (and I'm not sure how to specify n either since I'm having trouble reading OCaml code). – user541686 Jan 27 '16 at 12:15
Sure. Install OCaml with the native code ocamlopt compiler under Linux or Mac OS X. Compile the program with ocamlopt nqueens.ml -o nqueens and run it with ./nqueens 11 to get the result for n=11. – J D Jan 27 '16 at 13:20
Thanks! Question: does OCaml implement closures the same way you've implemented them in C++? (i.e. are you sure it does not inline them, especially given that you've turned on optimizations?) – user541686 Jan 28 '16 at 20:05
So I reproduced your results, but the main trouble I'm having with your code is that I can't tell if it's actually an apples-to-apples comparison to the OCaml code -- I have no idea what kinds of optimizations OCaml can do. For example in C++ you constantly transfer ownership of pointers, which is totally unnecessary and typical C++ code would not do, but for example in OCaml a bit of escape analysis might be able to let the compiler bypass a lot of the reference counting bookkeeping, or to allocate something on the stack. Have you eliminated such potential sources of error? – user541686 Jan 28 '16 at 20:13
You even have an assignment like ps = ps->next which adds unnecessary overhead given that ps is reference-counted. Note that "no garbage collection" does not mean "use reference counting everywhere". People who code in a language like C++ would only use reference counting where it is necessary, staying with single-owner pointers (e.g. unique_ptr) where possible. I would completely concede that littering my code with shared_ptr could slow it down far beyond how a GC'd equivalent might execute, but that's neither the claim nor the practice. I suspect this isn't the right comparison. – user541686 Jan 28 '16 at 20:17
1

@Mehrdad:"Have you eliminated such potential sources of error?". Yes. OCaml has a very simple compilation model with no optimisations such as escape analysis. OCaml's representation of the closure is actually substantially slower than the C++ solution so it should really use a custom List.filter as the C++ does. But, yes, you are certainly quite right that some RC operations can be elided. However, the biggest problem I see in the wild is that people don't have the time to perform such optimisations by hand on large industrial code bases. – J D Jan 28 '16 at 22:04
Interesting. Are you saying you actually see people using shared pointers widely in industrial code even when using unique or even raw pointers is obviously better and requires no additional effort on their part? I ask because it seems to me that using shared pointers is both more effort and more inefficient -- i.e. I don't see why anyone would choose it in a situation where they could just as easily use a unique or raw pointer (like with isEmpty) – user541686 Jan 28 '16 at 22:59
2

Yes, absolutely. No additional effort to write but writing code isn't the bottleneck with C++. Maintaining code is. Maintaining code with this kind of incidental complexity is a nightmare. Most industrial code bases are millions of lines of code. You just don't want to have to deal with that. I've seen people convert everything to shared_ptr just to fix concurrency bugs. The code is a lot slower but, hey, now it works. – J D Jan 29 '16 at 00:42
1

+1 for mentioning programmer productivity. Something that is usually forgotten about when talking about performance, and something often more important than maximum possible performance. – JonasH Oct 14 '22 at 14:57

score 2 · Answer 5 · answered Jul 05 '13 at 07:01

It should be considered that GC is not just a memory management strategy; it also makes demands on the entire design of the language and runtime environment, that impose costs (and benefits). For example, a language that supports GC has to be compiled into a form where pointers can't be hidden from the garbage collector, and generally where they can't be constructed except by carefully managed system primitives. Another consideration is the difficulty of maintaining response time guarantees, as GC imposes some steps that have to be allowed to run to completion.

Consequently, if you have a language that is garbage collected, and compare the speed with manually managed memory in the same system, you still have to pay the overhead to support garbage collection even if you're not using it.

score 1 · Answer 6 · answered Jun 10 '14 at 22:18

Faster is dubious. However, it can be ultra-fast, imperceptible, or faster if it's hardware supported. There were designs like that for LISP machines a long time ago. One built the GC into the memory subsystem of the hardware as such that main CPU didn't know it was there. Like many later designs, the GC ran concurrently with the main processor with little or no need for pauses. A more modern design is Azul Systems Vega 3 machines that run Java code way faster than JVM's using purpose-built processors and a pauseless GC. Google them if you want to know how fast GC (or Java) can be.

davidbak · Answer 7 · 2022-10-15T04:42:36.127

Perhaps the first academic paper that asserts this was "Garbage collection can be faster than stack allocation" (Appel, 1987)^† The assertion was true for certain use cases - in particular from the abstract:

An old and simple algorithm for garbage collection gives very good results when the physical memory is much larger than the number of reachable cells. In fact, the overhead associated with allocating and collecting cells from the heap can be reduced to less than one instruction per cell by increasing the size of physical memory. Special hardware, intricate garbage-collection algorithms, and fancy compiler analysis become unnecessary.

And from the paper's introduction:

This paper shows that, with enough memory on the computer, it is more expensive to explicitly free a cell than it is to leave it for the garbage collector — even if the cost of freeing a cell is only a single machine instruction.

In many practical use cases the live cells/objects at GC time is a small fraction of the total amount allocated in that GC period. (This of course is the observation that started all the R&D into generational garbage collectors.) And some language systems take this to an extreme: the functional languages. Functional languages are extremely heavy in their use of heap memory, and thus, fast GC is critically important.

The algorithm discussed in the paper is a stop-and-copy copying one in the context of one such functional programming system: An ML compiler.

A careful analysis is made showing that if the cost of popping a record from the stack, or of freeing heap allocated memory is (some) constant, then there's a cross-over point where, if there's sufficient free memory so that you don't need to GC that often, it's cheaper to free by copying active cells than by explicit frees of no-longer-needed allocated cells/objects. And then after that it's shown that allocation overhead can be zero if you can arrange to use an inaccessible page - thus a page fault - to signal it is time to GC. (And most OSes let you do that.)

Experiments were run on a special machine that allowed them to test performance as available memory was increased. The project was the "Massive Memory Machine" (at Princeton). The special "massive memory" machine was a VAX-11 with 128MB of memory. (Yes, that's right: Not terabytes, or even gigabytes. "Massive" was 128 megabytes. Well, it was 1986.)

^† Official full-text copies are easily found on the web.

score 0 · Answer 8 · answered Jul 05 '13 at 07:24

0

Such an example necessarily has a bad manual memory allocation scheme.

Assume the best garbage collector GC. It internally has methods to allocate memory, determine what memory can be freed, and methods to finally free it. Together these take less time than all of GC; some time is spent in the other methods of the GC.

Now consider a manual allocator that uses the same allocation mechanism as GC, and whose free() call just sets aside the memory to be freed by the same method as GC. It doesn't have a scan phase, nor does it have any of the other methods. It necessarily takes less time.

answered Jul 05 '13 at 07:24

MSalters

8,802

3

A garbage-collector can often free up many objects, without having to put the memory into a useful state after each one. Consider the task of removing from an array-list all items meeting a certain criterion. Removing a single item from an N-item list is O(N); removing M items from a list of N, one at a time is O(M*N). Removing all items meeting a criterion in a single pass through the list, however, is O(1). – supercat Mar 10 '14 at 02:57
@supercat: free can also collect batches. (And of course removing all items meeting a criterion is still O(N), if only because of the list traversal itself) – MSalters Mar 10 '14 at 10:18
Removing all items meeting a criterion is at least O(N). You're correct that free could operate in a batch-collect mode if each memory item had a flag associated with it, though GC can still come out ahead in some situations. If one has M references which identify L distinct items out of a set of N things, the time to remove every reference to which no reference exists and consolidate the remainder is O(M) rather than O(N). If one has M extra space available, the scaling constant can be quite small. Further, compactification in a non-scanning GC system requires... – supercat Mar 10 '14 at 16:08
@supercat: Well, it's certainly not O(1) as your last sentence in the first comment states. – MSalters Mar 10 '14 at 16:10
...either adding an extra level of indirection to every object access, or else maintaining a linked list of references to every object (which in a multi-processor system would likely require that all reference assignments use interlocked memory access). – supercat Mar 10 '14 at 16:11
Oops... that was a typo. I meant to say O(N) [blush]. I think it's worth noting that in a generational GC system, the marginal cost of a typical memory allocation is simply the cost of a pointer update, the amortized cost of acquiring another memory pool for the thread, and the amortized cost of moving those items which survive their first generation-0 GC cycle but are no longer in use by the second. Since the vast majority of items in many cases won't survive their first generation-0 GC, the only non-trivial costs are the amortized ones... – supercat Mar 10 '14 at 16:15
...and those are generally not severe. In a non-compatifying GC, it's much harder to maintain separate per-thread allocation pools; consequently, one will generally end up needing many more interlocked operations. This difference is huge, given that in some GC systems, the amortized cost of allocating and abandoning a small object may actually be comparable to--or even less than--the cost of a single interlocked memory operation. – supercat Mar 10 '14 at 16:20
This answer assumes that GCs allocate and free individual objects but that is not a valid assumption. All of the objects in the nursery generation are collected by a single assembly instruction. – J D Jan 27 '16 at 00:10
@JonHarrop: And what would prevent a deterministic scheme from having a nursery? My answer does not make assumptions about the exact implementation. – MSalters Jan 27 '16 at 08:37
1

@MSalters: "And what would prevent a deterministic scheme from having a nursery?". Nothing. OCaml's tracing garbage collector is deterministic and uses a nursery. But it is not "manual" and I think you are misusing the word "deterministic". – J D Jan 27 '16 at 12:49
1

@supercat A specialized allocator can always be faster than a general allocator like the typical malloc/realloc/free implementation. However, a non-compacting GC built on top of any allocator will necessairly be slower than said allocator, and a compacting one will still be slower than a program which tracks it's pointer lifetimes. – yyny Sep 10 '19 at 11:08

score 0 · Answer 9 · answered Oct 14 '22 at 12:41

There are more than just two methods: The ones that I have used are garbage collection, manual memory management, manual reference counting, and automatic (compiler generated) reference counting.

One thing when you look at "faster": Something that is simpler is automatically faster. Say I write a program using manual memory management and it takes me ten days. And you write a program using garbage collection, and because this is easier, it takes you eight days. That means you have two days to make your code faster that I don't have, so whether garbage collection is faster or slower doesn't matter much. After ten days, your code will be faster. In the end, I don't care about the time for manual memory management because it is a pain. (C++ has things to reduce the pain though).

Reference counting has the disadvantage that some work needs doing every time a reference to an object is created or removed. Automatic reference counting helps with this, the compiler will often be able to remove matching pairs of add/remove reference. Knowing how it works helps: If you pass five strings to a function, that's five reference counts that need to be incremented / decremented. Pack the five strings into an object, and it's only one reference. Five times faster, and better for other reasons as well. And reference counting takes time. (Rumour is that Apple has made hardware changes to its ARM chips to make changing a reference count five times faster than on an x86; that may make a substantial difference).

And then comes the time where 1000 objects have been allocated and need releasing. The actual releasing may be faster with garbage collection, because it knows all pointers involved are valid. It may know that a whole area filled with objects can be released which may be only one operation. On the other hand, the speed of the actual allocation/deallocation without reference counting can and should be optimised. Quite conceivable that garbage collection beats a bad allocator, but not a good one.

And then there is the question how much time is wasted for objects that actually remain allocated during garbage collection. For reference counting etc. they have zero cost.

So I would say this is implementation dependent, it is dependent on exactly what you are doing, you can't change it so if you have a ton of Java or Objective-C code you have no choice anyway, it's not enough to choose one language over another, and you should use the most convenient of the available languages.

Anti Gamer · Answer 10 · 2023-01-31T22:16:07.830

Efficient concurrent data structures for read access come to mind where GC of some form (including ref counting) is still arguably quite relevant even to me as a C++ programmer.

Read locks of any form (ex: shared mutex but also other forms) tend to be relatively very expensive in our use cases (many operations measured multiple times slower, even with relatively intensive processing per read iteration) due to the fact that we read over the same data in parallel across threads frequently (although writes are relatively infrequent) making the read lock highly contended (even when it's per-element). Since the read lock requires mutating data (effectively writing to shared memory atomically), this leads to false sharing all over the place just to read the elements of the concurrent structure with atomic writes involved to shared memory in what should conceptually be a read-only memory operation and definitely not an operation writing to memory that's shared between threads.

Handling Removal of Elements Still Being Accessed Without Read Locks to Access Elements

I cannot think of any solution to avoid mutating shared read lock states besides some form of GC (although I'm all ears if people have better ideas) which defers element removal/deallocation to a collector, since we have to be guaranteed that no other thread is reading the element in order to safely free it from memory.

Avoiding Shared Mutable State for Read-Only Access Using TLS

Some people might then think and point out that the GC state itself would requiring mutating shared data for reads, but we don't. We store a separate ref count in thread-local state per thread to avoid false sharing, and the collector then tallies them up in a stop-the-world fashion and frees the memory when the summed ref count is zero. This might sound explosive in memory use to have a whole separate set of thread-local ref counts per thread per element, but we don't do it per element; we do it for entire blocks of elements that can often contain as many as 4k elements each depending on the size of an element.

A Real-World Use Case

This is at least the best-performing solution I've come up with so far for our read-heavy but write-lights needs to avoid requiring threads to write to shared data (read lock data) when merely accessing the elements of the concurrent container in a strict read-only fashion, and it requires some form of GC. We use this solution in computer graphics with a highly parallelized entity-component system, and just using this GC solution here sped up an operation to load in a million polygon mesh from disk in a parallel loop to create the resulting mesh from 2 secs down to ~320ms, as the top hotspot profiling the code showed the times dominated by thread contention on the read locks of our concurrent structures[*]. This was also code that had been repeatedly measured and optimized by myself and others.

Loading a mesh from disk might sound like a write-heavy operation, but it tends to still be write-light and read-heavy. It's because meshes have to do a whole lot of reading in order to be created complete with not only polygons but edges, vertices, multiple texture UV coordinate maps, etc. For example, edge creation requires reading back the polygons of the mesh to determine what edges are shared between them. Even the process of creation of such structures are more read-heavy than write-heavy (unless we're just talking about triangle soup mesh reps and not ones like half-edges which share edges and vertices between polygons).

Anyway, this is at least an example of a real-world production use case sped up by the use of GC. At least I cannot think of how to avoid it and still speed things up without although I'm always interested in new ideas. But at least the simplest way to optimize this after mulling over it for days was to use GC to eliminate the requirement for read-locks in our iterators and random-access operators while still allowing element removal to occur in parallel while other threads are still reading that element. Often in talks about what's "faster" though, I think it's often a balance between efficiency, practicality for implementation, and the generality of the solution. GC is at least the best solution I could come up with (it was far from my first solution, and I've measured all of them) to achieve a balance of all three.

Demonstration of garbage collection being faster than manual memory management

10 Answers10