27

I've read in many places (heck, I've even written so myself) that garbage collection could (theoretically) be faster than manual memory management.

However, showing is a lot harder to come by than telling.
I have never actually seen any piece of code that demonstrates this effect in action.

Does anyone have (or know where I can find) code that demonstrates this performance advantage?

user541686
  • 8,102
  • 6
    the problem with GC is that most implementations are not deterministic so 2 runs can have vastly different results, not to mention it's hard to isolate the right variables to compare – ratchet freak Jul 04 '13 at 19:49
  • @ratchetfreak: If you know of any examples that are only faster (say) 70% of the time, that's fine with me too. There must be some way to compare the two, in terms of throughput at least (latency probably wouldn't work). – user541686 Jul 04 '13 at 19:56
  • Are you measuring machine performance or developer performance? Not having to think as hard about memory management could significantly reduce a developer's "days to delivered product" if the performance specs permit. – Dan Pichelman Jul 04 '13 at 20:12
  • @DanPichelman: Machine performance. Examples of the claims I've seen are here. – user541686 Jul 04 '13 at 20:13
  • 3
    Well, this is a bit tricky because you could always manually do whatever gives the GC an edge over what you did manually. Perhaps it's better to restrict this to "standard" manual memory management tools (malloc()/free(), owned pointers, shared pointers with refcount, weak pointers, no custom allocators)? Or, if you permit custom allocators (which may be more realistic or less realistic, depending on what kind of programmer you assume), put restrictions on the effort put into those allocators. Otherwise, the manual strategy "copy what the GC does in this case" is always at least as fast as GC. –  Jul 04 '13 at 20:41
  • @delnan: I don't see how you can have a "copy what the GC does" is possible though. A GC looks through the stack, static data segments, etc. in known locations to find references to the object graph's roots, but that's impossible in a native language C++ because there's no way to discover references like that, unless you make your own compiler (but then your code is restricted to that compiler). – user541686 Jul 04 '13 at 20:52
  • 1
    By "copy what the GC does" I didn't mean "build your own GC" (though note that this is theoretically possible in C++11 and beyond, which introduces optional support for a GC). I meant, as I've worded it earlier in the same comment, "do what gives the GC an edge over what you did manually". For example, if Cheney-like compaction helps this application a lot, you might manually implement a similar allocation + compaction scheme, with custom smart pointers to handle pointer fixup. Also, with techniques like a shadow stack, you can do root finding in C or C++, at the expense of extra work. –  Jul 04 '13 at 21:15
  • @delnan: Oh, I see what you mean now, that's a great point, thanks for bringing it up! – user541686 Jul 04 '13 at 21:22
  • It reminds me of the argument that JIT compiled languages can in theory be faster than native languages. Except they never are. You always give something up by moving to a higher level of abstractions, which is what GC is. It's the no free lunch principle. – Guy Sirton Jul 04 '13 at 21:28
  • @GuySirton s/native/AOT compiled/. Also, yes and no. In the JIT-vs-AOT case, it's the AOT compiler writers' skill vs the JIT compiler writers' skill. In this case, it's the GC writers' skill vs the skill of who manages the memory, which is rarely a highly qualified expert who has worked on making it fast for years. Not that this necessarily changes the outcome... –  Jul 04 '13 at 21:38
  • @GuySirton There are many real cases where programs in Java have outperformed programs in C++. http://keithlea.com/javabench/ contains some trivial examples. Of course in all such cases, the C++ program can be improved. But if you exclude cases based on that, we wind up in a "No True Scotsman" type fallacy where we are only allowed to compare perfectly optimized C++ programs with anything else. – btilly Jul 04 '13 at 22:16
  • @btilly: http://benchmarksgame.alioth.debian.org/ http://stackoverflow.com/questions/145110/c-performance-vs-java-c http://readwrite.com/2011/06/06/cpp-go-java-scala-performance-benchmark#awesm=~oaFvXDN4yyXkAt . Really my statement stands, you always give something up when you use higher abstractions. Trivially, GC can never be faster because whatever happens during GC that gives you performance can be mimicked using manual management but the opposite is not true. – Guy Sirton Jul 04 '13 at 23:30
  • @GuySirton: I can't reproduce the keithlea.com/javabench results. I just tried out the heapsort implementation, and even when comparing the output of my old C++ compiler (Visual C++ 13.10.4035) with the one from JRE 7, C++ beats Java quite noticeably. If you can reproduce any of them let me know which one and I'll try that one. – user541686 Jul 04 '13 at 23:39
  • @GuySirton: Then again, they don't even seem to be benchmarking GCs in the first place -- they seem to just be comparing C++ to Java, with preallocated storage... – user541686 Jul 04 '13 at 23:42
  • @Mehrdad: That was btilly's link. – Guy Sirton Jul 04 '13 at 23:43
  • Oops sorry my bad. @btilly should read my comment then. – user541686 Jul 04 '13 at 23:48
  • One can always make manual memory management as efficient as automatic garbage collection, probably with lower constant factors. The problem is the engineering cost: GC invisibly handles all the bookkeeping and special cases for you; if you manage memory by hand, you pretty much set your implementation in stone -- no algorithmic optimisation for you! In practice, the relative costs of automatic GC are small. – Rafe Jul 05 '13 at 00:00
  • @GuySirton Your link matches what I said. There are real programs whose implementation in a higher language can beat the implementation in a lower one. Yes, in theory you can win in the lower language. Doing so is not always easy. – btilly Jul 05 '13 at 02:10
  • @Mehrdad I have not tried to reproduce numbers. I've seen enough people claim specific examples that I believe the principle. Heck, the blog I pointed to demonstrates it with C++ vs C#. And I've personally done it with Perl vs C! (The C that I was replacing was definitely "not high quality".) – btilly Jul 05 '13 at 02:13
  • @Rafe I disbelieve your claim of great algorithmic benefits to managed memory. Starting from a base of using RIAA and tricks like std::shared_ptr you have about as much freedom to make algorithmic changes to your program as you do with managed memory. You'll need to dance carefully around circular references. You'll need to do more work and be more careful. But you can do it. And it usually isn't that much harder. (Until you over optimize. Then you're hosed.) – btilly Jul 05 '13 at 02:23
  • A final note. While some programs could be sped up by porting to a managed memory model, manual memory can do tricks that managed memory can't touch. For example I have a program with close to 1 million objects, each of which has a list of things associated with it. I put all of the lists, in order, in an arena. If any list gets too big I reallocate the whole arena. This gives me excellent utilization of CPU cache (3x speedup when I did it). A GC that traced my code enough to realize this was a good idea would be too slow because of the tracing! – btilly Jul 05 '13 at 02:34
  • @btilly -- My claim concerning algorithmic optimisation opportunities comes from the fact that, with managed code, you make NO commitments to how your memory is allocated. As soon as you DO make such commitments, which is unavoidable with manual memory management, you can't make changes to your algorithm -- or even non-trivial changes to your implementation -- without also changing your memory management strategy. That is a lot of error-prone work which may well perform worse! – Rafe Jul 05 '13 at 04:56
  • @btilly -- I do not disagree that manual memory management has its place. I would, however, argue that that place is a very small, specialised place. For the most part, in most circumstances, it just ain't worth the effort. – Rafe Jul 05 '13 at 04:57
  • @Rafe On algorithmic opportunities, my sense is that with unmanaged the effort of changing the algorithm is more than with managed..but by similar to the effort multiple of effort to choose unmanaged in the first place. (Bug opportunities are greater as well, part of unmanaged life.) So unmanaged is not a fundamental algorithmic optimization barrier. If you disagree, that should be a different conversation, with concrete examples. – btilly Jul 05 '13 at 14:35
  • @Rafe And on the place of manual memory management, we agree. In the last 5 years I've spent perhaps 5% of my time working with unmanaged code for performance reasons, and I've spent over 80% of it inside of SQL, Perl and Python. Your mileage almost certainly varies. Among most of my co-workers that percentage of unmanaged is very much on the high side. But I know people who spend most of their time on unmanaged code, and I'm very glad that they do so. (Hard real time code driving a rocket does not want GC pauses. Really.) – btilly Jul 05 '13 at 14:41
  • @Rafe So know what you're doing, and why you're doing it. Only do the hard stuff if you know why you're doing it and can prove that it is necessary. But if you've proven it (which in my case has always included writing a failed prototype first), do not hesitate to do what you need to do. – btilly Jul 05 '13 at 14:44
  • @btilly -- Hi, I thought I made that clear when I wrote, "I do not disagree that manual memory management has its place." Whenever you make a commitment to a particular implementation choice (such as manual memory management), you increase (dramatically) the amount of effort it takes to change that decision. I can't see how this is controversial. – Rafe Jul 08 '13 at 00:45
  • 1
    @Ike: It's okay. See why I asked the question though? That was the entire point of my question -- people come up with all sorts of explanations that should make sense but everyone stumbles when you ask them to provide a demonstration that proves what they say is correct in practice. The entire point of this question was to once and for all show that this can actually happen in practice. – user541686 Jan 06 '16 at 04:24
  • @Mehrdad Very much and very guilty! I've been studying the Java garbage collector lately a whole lot, trying to kind of reverse engineer how it works and implement a similar scheme in C++ (but something we can opt into). It's why I got all excited and kind of lost track of the whole point of the question. –  Jan 06 '16 at 04:28
  • @Mehrdad If you don't mind a crude explanation of that article, even in C++ we can potentially allocate memory a whole lot faster if we just allocated it in a straight sequential fashion using pooled, contiguous memory chained together. Now all complex alloc techniques required to generalize go away as well as page faults per chunk, the only prob is that we can't free any variable-sized chunk individually... that is, unless we had some deferred process that could copy the memory elsewhere using a more expensive strategy and did that in a separate thread. That's how eden alloc works. –  Jan 06 '16 at 04:35
  • @Mehrdad It's actually more expensive if you consider total CPU time.. but it's cheaper in terms of not stalling the thread allocating the memory by allowing it to allocate using the cheapest allocation technique possible. The expensive work to allocate individual chunks to be freed is then deferred to another thread. So its basic speed comes from just deferring the expensive stuff for later in a background thread. C++ would pay those full costs upfront in the same thread allocating the memory. –  Jan 06 '16 at 04:38
  • @Ike : Deferring to a background thread is fine but then you also have to take into account how long (in real time, i.e. in seconds) the background thread is actively running on the CPU as well. Not doing so would be cheating since a manual scheme that doesn't use that CPU core would have been able to get higher performance by using that core. Also, since this is a practical question, I'm not looking for obscure edge cases such as custom C++ allocators that are specialized for one use case. We're just using the standard allocators in each implementation, not attempting to bypass them. – user541686 Jan 06 '16 at 04:47
  • @Mehrdad Yeah -- in terms of real time GC has a very strong potential edge when they use this kind of eden space strategy, as it's balancing the load across threads, allowing the thread allocating to use a much cheaper allocation strategy than malloc, e.g. Chen had to kind of "cheat" and reach for the pool to beat it -- so C and C++ can potentially beat GC, but not with "normal/average" code (though I feel weird about this part since in my field it's pretty much mandatory to use preallocated mem pools in many cases). I've been looking to do something similar to what the Java GC does in... –  Jan 06 '16 at 09:41
  • @Mehrdad C and C++ for things like classes with pimpls and polymorphic classes captured through a base pointer, since this kind of multithreaded GC approach offers the best potential in those cases to allocate quickly and actually get a little better locality of ref. Tree structures are still often best tackled by a fixed alloc or sequential alloc, but pimpls and polymorphic classes can really benefit from this kind of "defer the expensive allocation/deallocation to another thread" strategy on hardware that benefits from multiple threads. –  Jan 06 '16 at 09:43

10 Answers10

28

See Link and follow all of the links to see Rico Mariani vs Raymond Chen (both very competent programmers at Microsoft) dueling it out. Raymond would improve the unmanaged one, Rico would respond by optimizing the same thing in the managed ones.

With essentially zero optimization effort, the managed versions started off many times faster than the manual. Eventually the manual beat the managed, but only by optimizing to a level that most programmers would not want to go to. In all versions, the memory usage of the manual was significant better than the managed.

Glorfindel
  • 3,137
btilly
  • 18,290
  • 1
  • 50
  • 76
  • +1 for citing an actual example with code :) although proper use of C++ constructs (such as swap) isn't that hard, and would probably get you there quite easily performance-wise... – user541686 Jul 04 '13 at 21:11
  • 5
    You may be able to outdo Raymond Chen on performance. I am confident that I can't unless he's out of it due to being sick, I'm working many times harder, and I got lucky. I don't know why he didn't choose the solution you would have chosen. I'm sure he had reasons for it – btilly Jul 04 '13 at 21:56
  • I copied Raymond's code here, and to compare, I wrote my own version here. The ZIP file that contains the text file is here. On my computer, mine runs in 14 ms and Raymond's runs in 21 ms. Unless I did something wrong (which is possible), his 215-line code is 50% slower than my 48-line implementation, even without using memory-mapped files or custom memory pools (which he did use). Mine is half as long as the C# version. Did I do it wrong, or do you observe the same thing? – user541686 Jul 05 '13 at 08:39
  • 1
    @Mehrdad Pulling out an old copy of gcc on this laptop I can report that neither your code nor his will compile, let alone do anything with it. The fact that I'm not on Windows likely explains that. But let's assume that your numbers and code are correct. Do they perform the same on a decade old compiler and computer? (Look at when the blog was written.) Maybe, maybe not. Let's suppose that they are, that he (being a C programmer) did not know how to use C++ properly, etc. What are we left with? – btilly Jul 05 '13 at 14:29
  • 1
    We are left with a reasonable C++ program which can be translated into managed memory and sped up. But where the C++ version can be optimized and sped up farther. Which is what we all are in agreement is the general pattern that always happens when managed code is faster than unmanaged. However we still have a concrete example of reasonable code from a good programmer that was faster in a managed version. – btilly Jul 05 '13 at 14:31
  • This link is now dead. Does anyone have an updated reference? – Alex Reinking Jun 15 '20 at 04:13
  • @AlexReinking The wayback machine didn't archive it. I don't have a copy. Sorry. – btilly Jun 15 '20 at 18:50
5

The rule of thumb is that there are no free lunches.

GC takes away the headache of manual memory management and reduces the probability of making mistakes. There are some situations where some particular GC strategy is the optimal solution for the problem, in which case you'll pay no penalty for using it. But there are others where other solutions will be faster. Since you can always simulate higher abstractions from a lower level but not the other way around you can effectively prove that there is no way higher abstractions can be faster than the lower ones in the general case.

GC is a special case of manual memory management

It may be a lot of work or more error prone to get better performance manually but that's a different story.

Guy Sirton
  • 1,885
  • 1
    That makes no sense to me. To give you a couple of concrete examples: 1) the allocators and write barriers in production GCs are hand-written assembler because C is too inefficient so how will you beat that from C, and 2) tail call elimination is an example of an optimisation done in high-level (functional) languages that is not done by the C compiler and, therefore, cannot be done in C. Stack walking is another example of something done below the level of C by high-level languages. – J D Jan 27 '16 at 00:03
  • 3
  • I'd have to see the specific code to comment but if the hand written allocators/barriers in assembler are faster then use hand written assembler. Not sure what that has to do with GC. 2) Take a look here: http://stackoverflow.com/a/9814654/441099 the point is not whether some non-GC language can do tail recursion elimination for you. The point is that you can transform your code to be as fast or faster. Whether the compiler of some specific language can do this for you automatically is a matter of convenience. In a low enough abstraction you can always do this yourself if you wish.
  • – Guy Sirton Jan 27 '16 at 00:34
  • 1
    That tail call example in C only works for the special case of a function calling itself. C cannot handle the general case of functions tail calling each other. Dropping to assembler and assuming infinite time for development is a Turing tarpit. – J D Jan 27 '16 at 00:54