When does bugfixing become overkill, if ever?

Question

Imagine you are creating a video player in JavaScript. This video player loops the user's video repeatedly. Each time a new loop begins the player runs a recursive function that calls itself N times, N being the number of times the video has looped, and because of that the browser will trigger a too much recursion RangeError at some time.

Probably no one will use the loop feature that much. Your application will never throw this error, not even if the user left the application looping for a week, but it still exists. Solving the problem will require you to redesign the way looping works in your application, which will take a considerable amount of time. What do you do? Why?

Fix the bug
Leave the bug

Shouldn't you only fix bugs people will stumble in? When does bugfixing become overkill, if it ever does?

Btw, since video playing is an asynchronous task you will not get a stack overflow from a recursive approach at all. So for this particular example: no, the problem doesn't even exist :-D — Bergi, Oct 17 '16 at 13:47
You are starting on an wrong assumption that this is even a bug. — PlasmaHH, Oct 17 '16 at 14:49
@PlasmaHH I'm using this hypothetical scenario to explain my question. If the bug exists or not it doesn't matter at all — Tiago Marinho, Oct 17 '16 at 14:53
@TiagoMarinho: the point I am trying to make is: sometimes its just the right thing to do to define such a scenario as the intended behaviour. — PlasmaHH, Oct 17 '16 at 14:54
You start with assumptions, make very sure those assumptions are right, because hey your video player I use for my customers will give me headaches because I'm running a station in kiosk mode for weeks on end, defeating your use case. At the very least document the behaviour. — Pieter B, Oct 17 '16 at 16:44
Why on Earth would you run such a loop using recursion in the first place? You might not want to fix the bug, but you sure ought to reconsider your design process :-) — jamesqf, Oct 17 '16 at 17:09
Most commercial software development won't ever fix a bug in a piece of released software that does not affect a customer (i.e. is not reported by a customer but found in-house). There's simply no budget for such fixes. — tofro, Oct 17 '16 at 17:27
This seems more like a business question. You have to prioritize based on the cost-to-fix, and the impact/frequency of the bug. — Casey Kuball, Oct 17 '16 at 17:28
As a developer, you document the issue as a bug report and let management decide when to fix it. — Simon Richter, Oct 19 '16 at 10:53
I'd hate to be the developer that has to troubleshoot this problem when a customer has a glitch and only .03 seconds of video comes through so you're looping many times a second causing a very strange early failure. The cost of finding and fixing at this point is staggering, would have been comparatively free to just fix the bug in the first place--just a few man-days of dev work vs man-weeks of dealing with customers, debugging, analyzing, etc... and you would have ended up with better code! — Bill K, Oct 19 '16 at 21:47
can i just point out that 2^53 is quite a large number. Even with a 1 second video it would take 104,249,991,374 days of 24/7 video playing for the bug to appear. It was already pointed out that the async nature means the bug doesn't really exist, but even if the bug did exist, the user's screen would stop working long before anyone saw this bug actually happen. Actually, after 285,616,414 years it's not clear human civilisation will still exist... — David Meister, Oct 20 '16 at 13:46
@DavidMeister that's exactly the point: Is it overkill to fix a bug that is so unlikely to be triggered? Others have taken the example literally and pointed out that such a bug could be triggered more often if the user's video lasts like 0.01 seconds, missing the whole point of the question. But you get the idea. — Tiago Marinho, Oct 20 '16 at 14:02
And yes, I may have exaggerated the example case a bit, since 2^53 loops is unreachable. Nothing is going to last that long anyway so the program will crash for something else first. — Tiago Marinho, Oct 20 '16 at 14:06
Just gonna throw this out here: not all bugs are necessarily difficult to fix. For instance, there is a thing called tail recursion which makes recursion compile into loops. I know this is just an example, but it is an important thing to consider: sometimes fixing the most obscure and unreachable bug leads to the development of something fare more useful in other areas. Would I go and rewrite the entire video player? Absolutely not. However, if hypothetically tail recursion optimization didn't exist yet, then solving this bug might lead to creating tail recursion optimization. — user64742, Oct 21 '16 at 02:22
tl;dr don't bugfix excessively if it costs you money or time, but if you have the time and you just want to fix a difficult bug, then you might as well try. You might find out something awesome in the process that you can share with other people in similar situations. Obviously I'm not referring to proprietary things. I simply mean if you find out something useful for programming in general by solving a really nasty bug, then it might be something you may or may choose to share. But hey, now your code lacks that nasty bug everyone else's video player has! — user64742, Oct 21 '16 at 02:25
@TiagoMarinho no, I don't get the idea. This example is not a bug unless you're willing to classify all software as a bug. Any software that you run for 2^53 * 0.1 seconds would destroy any hardware that it runs on through simple wear and tear. We're talking 20 million years of constant usage after all. The problem is that if we're willing to concede that the example is in fact "a bug" then there is nothing that is not a bug. It's like saying that UUIDs are "a bug" because there is a chance of collision... — David Meister, Oct 21 '16 at 14:53
@TiagoMarinho the problem here is that any sensible definition of a "bug" or "defect" comes with an idea of both risk and impact. If either the risk or impact are literally zero within the lifetimes of literally everyone who will ever come in contact with the software then it isn't a defect at all. Now, we can say that there are things that are bugs with a almost-but-not-quite-zero risk/impact, and that is fair, but this example is nowhere near that scale. The thing is, once we are talking about something with measurable risk/impact, the answer to the question becomes self evident... — David Meister, Oct 21 '16 at 15:01
@TiagoMarinho perhaps a difficulty with the question is the assumption that any code that could be theoretically reached and throws an error or otherwise halts/impairs the system must be "a bug" but this is not the case. Consider the example of creating a random number generator without modulo bias. To fix this very real and potentially quite serious bug (a predictably biased prng) you must introduce a non-zero chance that your code will hang indefinitely - http://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator — David Meister, Oct 21 '16 at 15:12
@TiagoMarinho but the recursive approach is also not a bug in javascript due to the async nature of the operation... — David Meister, Oct 22 '16 at 00:23

score 165 · Accepted Answer · answered Oct 17 '16 at 05:36

165

You have to be pragmatic.

If the error is unlikely to be triggered in the real world and the cost to fix is high, I doubt many people would consider it a good use of resources to fix. On that basis I'd say leave it but ensure the hack is documented for you or your successor in a few months (see last paragraph).

That said, you should use this issue as a "learning experience" and the next time you do looping do not use a recursive loop unnecessarily.

Also, be prepared for that bug report. You'd be amazed how good end users are at pushing against the boundaries and uncovering defects. If it does become an issue for end users, you're going to have to fix it - then you'll be glad you documented the hack.

answered Oct 17 '16 at 05:36

mcottle

6,142
2
25
27

1

Note that the video-player and it's looping feature are completely hypothetical, hahah. But yeah, I agree. – Tiago Marinho Oct 17 '16 at 05:49
122

Totally agree with "You'd be amazed how good end users are at pushing against the boundaries and uncovering defects." – Spotted Oct 17 '16 at 07:05
77

End users are in no way restricted by what you think is a reasonable use of your software. There will be users who want to loop a video forever. It's a feature that your software provides, so they will use it. – gnasher729 Oct 17 '16 at 09:01
37

@gnasher729 "10-hour XXXX" videos on Youtube is a good identifier that indeed, some people just want to loop something forever. – Chris Cirefice Oct 17 '16 at 13:16
24

Another problem: If your software is popular, then someone encounters a bug that indeed happens in a rare situation only, posts it on the internet, and suddenly everyone and their dog says "this software is rubbish, it crashes if I loop a video for a day". Or a competitor uses it to demonstrate how easy it is to crash your application. – gnasher729 Oct 17 '16 at 13:19
2

Could have ended the answer after the first sentence. – Ant P Oct 17 '16 at 15:40
1

This is the answer. Stick it in the bug tracker and fix it when you can. – Oct 17 '16 at 20:59
4

Emphasis on the last paragraph. Did you know that MacOS Classic would crash if it received 32,768 consecutive "mouse press" events without an intervening "mouse release" event? – Mark Oct 17 '16 at 21:20
1

@Mark That is absolute proof of "how good end users are at pushing against the boundaries and uncovering defects" I know I'm amazed... – Jerry Jeremiah Oct 18 '16 at 01:20
3

Such code could easily by installed on some embedded machine in a mall or maybe consumer-grade fridge displays and cause freezes (pun not intended) after a week of non-stop looping. – user1306322 Oct 18 '16 at 13:30
5

I agree this is the best answer but pjc50's answer below should be considered as well. Ask yourself: What is the worst thing that could happen? Suppose your player was used to display some kind of safety warning at the entrance to a facility. The player crashes and the warning ceases, then someone is hurt or even killed because they never got the warning. Be sure to include a very clear disclaimer that the software should not be used for mission critical applications. Also be sure to HANDLE the error condition and not just allow it to crash. – O.M.Y. Oct 18 '16 at 14:00
2

Realize that this actually is NOT pragmatic to ignore in this case. True, edge cases are generally things you document and ignore largely, but this is NOT an edge case in the truest sense. Is it rare? Maybe, but if the application is a video looper, it is a huge problem if there is a bug in the video looper! There is literally a point in which the application crashes during the desired use. Edge cases are cases where the user is using it for an unintended purpose, not using it longer than intended. If this was real, it should be something fixed immediately, because it is the focus of the app. – EvSunWoodard Oct 19 '16 at 19:08

score 80 · Answer 2 · answered Oct 17 '16 at 15:40

80

There was a similar bug in Windows 95 that caused computers to crash after 49.7 days. It was only noticed some years after release, since very few Win95 systems stayed up that long anyway. So there's one point: bugs may be rendered irrelevant by other, more important bugs.

What you have to do is a risk assessment for the program as a whole and an impact assessment for individual bugs.

Is this software on a security boundary?
If so, can this bug result in an exploit?
Is this software "mission critical" to its intended users? (See the list of things the Java EULA bans you from using it for)
Can the bug result in data loss? Financial loss? Reputational loss?
How likely is this bug to occur? (You've included this in your scenario)

And so on. This affects bug triage, the process of deciding which bugs to fix. Pretty much all shipping software has very long lists of minor bugs which have not yet been deemed important enough to fix.

answered Oct 17 '16 at 15:40

pjc50

13,377
1
31
35

2

I also recall the (hardware) bug in some Intel CPUs where a specific floating point value went all wrong. – Oct 17 '16 at 16:11
5

@WilliamKappler https://en.wikipedia.org/wiki/Pentium_FDIV_bug is what I believe you are referring to. Was up for a year before anybody noticed it. – Jeutnarg Oct 17 '16 at 20:20
@Jeutnarg I thought it was more recent than that, but I don't remember the details. I could be conflating a few different bugs. – Oct 17 '16 at 20:38
@william Maybe you're thinking of the TSX transactional memory bug which resulted in the instructions being disabled in a microcode update. – Jeffrey Bosboom Oct 17 '16 at 21:51
1

That 49.7 day bug cost Microsoft dearly in reputation. – gnasher729 Oct 17 '16 at 22:51
10

@gnasher729 - Not really, they were already at the bottom and still digging :) Most people had to re-install Win 95 more frequently than 49.7 days IIRC. – mcottle Oct 18 '16 at 03:15
1

@mcottle Silly. One, this has nothing to do with reinstalls. Two, of course a home system wasn't designed to run 24/7. Windows 95+ had to make a lot of cuts to make things barely work on the typical home computer. Did your Commodore 64 run for 60 days? Or your AtariST? Or your Amstrad CPC? Or your Amiga? Do you think Unixes of the time survived that long? Apple IIe could, but that was a business machine as well - it's not like Windows NT had the problem. If you hosted a server on Windows 95, you were an idiot (or you were doing it for fun :)). Windows 95 won the home computer market. – Luaan Oct 18 '16 at 08:09
4

@Luaan The comment was intended as a lighthearted dig at M$, hence the smiley after the first sentence. They were behind the eightball with '95 because it came out very late in 95 (probably because having Win95 released in 1996 would have been a bad look), half baked (Remember the USB BSOD?) and inclined to become unstable and require regular reinstalls hence my second sentence - which never mentioned running a server on Windows 95, I don't know where you got THAT from (flashbacks?). The second release CD improved matters but the initial release of '95 was a doozy. – mcottle Oct 18 '16 at 08:26
5

TBH I think it was the "Windows for Warships" fiasco that did more reputational damage ( http://archive.wired.com/science/discoveries/news/1998/07/13987 ) and that was using NT. Unix machines of that time could manage multi-year uptimes, even using (very early) versions of Linux. All the home computers were also capable of high uptime, although rarely used that way. I saw BBC micros embedded in educational exhibits a decade after they were obsolete. – pjc50 Oct 18 '16 at 08:59
@Luaan: yes, Unix/Linux desktop and server systems of the time did routinely manage uptimes many times longer than that. You're right about other systems like the Atari ST, though. IDK if I ever left mine on for that long, but given the lack of memory protection in the ST / STe, I think I usually ended up rebooting more often than that (since I played lots of games and used multiple TSR programs that could interact in weird ways and leave the system unstable). – Peter Cordes Oct 18 '16 at 20:46
https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran is another classic example of rounding errors building up when something was left running longer than expected. – armb Oct 19 '16 at 15:51
1

@PeterCordes Although I loved my Atari 1040 STF I was never going to mistake it for MilSpec. It once locked up running Cubase in the time it took me to eat a meal. The drum loop that couldn't be killed - without a hard reset - which got me running back to Steinberg Pro 24. Never had the guts to gig it. IIRC I had to boot it up using a 3.5" floppy. Can't remember if that was the O/S or just my config. Awesome value for the era and the paper white monitor was great. A good option for impoverished students that couldn't afford a Mac. – mcottle Oct 20 '16 at 05:29
@mcottle: IIRC, the whole OS was in ROM, but everyone had their favourite terminate-and-stay-resident 3rd-party software on a boot disk. My dad had a Mega4 STe (4MB RAM vs. 1MB in the ST 1040, and 16MHz CPU vs. the 8MHz in the ST), with an 80MB hard drive, which was pretty awesome. When I was first learning C, before I got a PC to Linux, I was using gcc on the Atari. I taught myself assembly language mostly because the compile times were so high (while I was tweaking a Mandelbrot program I typed in from a book, to make it run faster), before realizing you could beat the compiler :P. – Peter Cordes Oct 20 '16 at 06:52

score 33 · Answer 3 · answered Oct 17 '16 at 15:32

The other answers are already very good, and I know your example is just an example, but I want to point out a big part of this process that hasn't been discussed yet:

You need to identify your assumptions, and then test those assumptions against corner cases.

Looking at your example, I see a couple assumptions:

The recursive approach will eventually cause an error.
Nobody will see this error because videos take too long to play to reach the stack limit.

Other people have discussed the first assumption, but look at the second assumption: what if my video is only a fraction of a second long?

And sure, maybe that's not a very common use case. But are you really sure that nobody will upload a very short video? You're assuming that videos are a minimum duration, and you probably didn't even realize you were assuming anything! Could this assumption cause any other bugs in other places in your application?

Unidentified assumptions are a huge source of bugs.

Like I said, I know that your example is just an example, but this process of identifying your assumptions (which is often harder than it sounds) and then thinking of exceptions to those assumptions is a huge factor in deciding where to spend your time.

So if you find yourself thinking "I shouldn't have to program around this, since it will never happen" then you should take some time to really examine that assumption. You'll often think of corner cases that might be more common than you originally thought.

That being said, there is a point where this becomes an exercise in futility. You probably don't care if your JavaScript application works perfectly on a TI-89 calculator, so spending any amount of time on that is just wasted.

The other answers have already covered this, but coming up with that line between "this is important" and "this is a waste of time" is not an exact science, and it depends on a lot of factors that can be completely different from one person or company to another.

But a huge part of that process is first identifying your assumptions and then trying to recognize exceptions to those assumptions.

Very good point Kevin. Note my comment on the selected answer above that focuses on the analysis question What's the worst thing that could happen? — O.M.Y., Oct 18 '16 at 14:08
Another assumption here is that an ever-growing stack will only lead to problems when it reaches an overflow size. In fact, the stack can be a normal resource this bug is constantly leaking. The whole browser could become slower and slower by tiny bits on each iterat^H^H^H^H^H^Hrecursion. — Alfe, Oct 19 '16 at 12:31

Vladimir Stokic · Answer 4 · 2016-10-17T12:22:14.460

I would recommend that you read the following paper:

Dependability and Its Threats: A Taxonomy

Among other things, it describes various types of faults that can occur in your program. What you described is called a dormant fault, and in this paper it is described like this:

A fault is active when it produces an error, otherwise it is dormant. An active fault is either a) an internal fault that was previously dormant and that has been activated by the computation process or environmental conditions, or b) an external fault. Fault activation is the application of an input (the activation pattern) to a component that causes a dormant fault to become active. Most internal faults cycle between their dormant and active states

Having described this, it all boils down to a cost-benefit ratio. The cost would consist of three parameters:

How often would the issue present itself?
What would the consequences be?
How much it bothers you personally?

The first two are crucial. If it is some bug that would manifest itself once in a blue moon and/or nobody cares for it, or have a perfectly good and practical workaround, then you can safely document it as a known issue and move on to some more challenging and more important tasks. However, if the bug would cause some money transaction to fail, or interrupt a long registration process, thus frustrating the end user, then you have to act upon it. The third parameter is something I strongly advise against. In the words of Vito Corleone:

It's not personal. It's business.

If you are a professional, leave the emotions aside and act optimally. However, if the application you are writing is a hobby of yours, then you are emotionally involved, and the third parameter is as valid as any in terms of deciding whether to fix a bug or not.

'It's not personal. It's business' is by Michael I guess, not Vito. (You'd be amazed how good end users are at pushing against the boundaries and uncovering defects) — 384X21, Oct 19 '16 at 13:48
Actually, it is by Vito, if you read the book. Even in the movie, it is Tom Hagen that says that first when arguing with Sonny about whether they should go to the mattresses, and only after that does Michael first says the famous quote: "It's not personal, Sonny. It's strictly business...". But Hagen learned that from Vito. — Vladimir Stokic, Oct 19 '16 at 17:59

score 11 · Answer 5 · answered Oct 17 '16 at 14:34

That bug only stays undiscovered until the day someone puts your player on a lobby screen running a company presentation 24/7. So it's still a bug.

The answer to What do you do? is really a business decision, not an engineering one:

If the bug only impacts 1% of your users, and your player lacks support for a feature required by another 20%, the choice is obvious. Document the bug, then carry on.
If the bugfix is on your todo list, it's often better to fix it before you start adding new features. You'll get the benefits of zero-defect software development process, and you won't lose much time since it's on your list anyway.

score 5 · Answer 6 · edited Oct 17 '16 at 19:41

5

There are actually three errors in the situation you describe:

The lack of a process to evaluate all logged errors (you did log the error in your ticket/backlog/whatever system you have in place, right?) to determine whether it should be fixed or not. This is a management decision.
The lack of skills in your team that leads to the use of faulty solutions like this. This is urgent to have this addressed to avoid future problems. (Start learning from your mistakes.)
The problem that the video may stop displaying after a very long time.

Of the three errors only (3) might not need to be fixed.

edited Oct 17 '16 at 19:41

Scott Weldon

119

answered Oct 17 '16 at 09:56

Bent

2,576
1
14
18

Thanks for pointing out the 2nd-order problems. Too many people only treat the symptom, and the cause keeps right on creating more symptoms. – jaxter Oct 21 '16 at 14:50

score 5 · Answer 7 · answered Oct 17 '16 at 11:58

5

Expecially in big companies (or big projects) there's a very pragmatic way to establish what to do.

If the cost of the fixing is greater than the return that the fix will bring then keep the bug. Viceversa if the fix will return more than its cost then fix the bug.

In your sample scenario it depends on how much users you expect to lose vs how much user you will gain if you develop new features instead of correcting that expensive bug.

answered Oct 17 '16 at 11:58

JoulinRouge

678

6

The ROI for fixing a bug is seldom easy to evaluate - you generally have to rely on your judgment. – Ant P Oct 17 '16 at 15:42
The return that the fix will bring is mostly reputation which is almost impossible to quantify. If I am the only one that even knows that there might be a bug and then in a year or two I switch jobs and the new company is thinking of embedding a video player in their product (possibly selling millions of units) would I recommend using this one? – Jerry Jeremiah Oct 18 '16 at 01:26
@JerryJeremiah if the bug prevents a business process from running it's not about reputation, it depends on the business process' importance. And in every case and every policy you apply to correct bugs or not you have to make a subjective evaluation based on your experience and knowledge. Even if you can know the exact number of user who will face the bug you still have to make a human choice (also ROI policy can also include bug hits stats to extimate costs). As today there's no mechanical way to know a priori the perfect thing to do. – JoulinRouge Oct 18 '16 at 08:22

score 5 · Answer 8 · answered Oct 17 '16 at 16:26

tl;dr This is why RESOLVED/WONTFIX is a thing. Just don't overuse it - technical debt can pile up if you're not careful. Is this a fundamental problem with your design, likely to cause other problems in the future? Then fix it. Otherwise? Leave it be until it becomes a priority (if it ever does).

score 4 · Answer 9 · answered Oct 17 '16 at 20:14

There are lots of answers here discussing evaluating the cost of the bug being fixed as opposed to leaving it. They all contain good advice, but I'd like to add that the cost of a bug is often underestimated, possibly hugely underestimated. The reason is that existing bugs muddles the waters for continued development and maintenance. Making your testers keep track of several "won't fix" bugs while navigating your software trying to find new bugs make their work slower and more prone to error. A few "won't fix" bugs that are unlikely to affect end users will still make continued development slower and the result will be buggier.

score 2 · Answer 10 · answered Oct 18 '16 at 07:39

One thing I've learned in my years of coding is that a bug will come back. The end user will always discover it and report back. Whether you will fix the bug or not is "merely" a priority and deadline matter.

We've had major bugs (in my opinion major) that were decided against fixing in one release, only to become a show stopper for the next release because the end user stumbled upon it over and over again. The same vice-versa - we were pushed to fix a bug in a feature that nobody uses, but it was handy for management to see.

score 2 · Answer 11 · edited Oct 20 '16 at 10:58

There are three things here:

Principles

This is one side of the coin. To some extent, I feel it is good to insist on fixing bugs (or bad implementations, even if they "work"), even if nobody is noticing it.

Look at it this way: the real problem is not necessarily the bug, in your example, but the fact that a programmer thought it was a good idea to implement the loop in this fashion, in the first place. It was obvious from the first moment, that this was not a good solution. There are now two possibilities:

The programmer just did not notice. Well... a programmer should develop an intuition of how his code runs. It is not like recursion is a very difficult concept. By fixing the bug (and sweating through all the additional work), he maybe learns something and remembers it, if only to avoid the additional work in the future. If the reason was that he just not had enough time, management might learn that programmers do need more time to create higher quality code.
The programmer did notice, but deemed it "not a problem". If this is left to stand, then a culture of laissez-faire is developed that will, ultimately, lead to bugs where it really hurts. In this particular case, who cares. But what if that programmer is developing a banking application next time, and decides that a certain constellation will never happen. Then it does. Bad times.

Pragmatism

This is the other side. Of course you would likely, in this particular case, not fix the bug. But watch out - there is pragmatism, and then there is pragmatism. Good pragmatism is if you find a quick but yet solid, well founded solution for a problem. I.e., you avoid overdesigning stuff, but the things you actually implement are still well-thought-out. Bad pragmatism is when you just hack something together which works "just so" and will break at the first opportunity.

Fail fast, fail hard

If in doubt, fail fast and fail hard.

This means, amongst others, that your code notices the error condition, not the environment.

In this example, the least you can do is to make it so the hard runtime error ("stack depth exceeded" or something like that) does not occur, by replacing it by a hard exception of your own. You could, for example, have a global counter and arbitrarily decide that you bail out after 1000 videos (or whatever number is high enough never to occur in normal use, and low enough to still work in most browsers). Then give that exception (which can be a generic exception, e.g. a RuntimeException in Java, or a simple string in JavaScript or Ruby) a meaningful message. You do not have to go to the extent to create a new type of exceptions or whatever you do in your particular programming language.

This way, you have

...documented the problem inside the code.
...made it a deterministic problem. You know that your exception will happen. You are not at the whim of changes in the underlying browser technology (think about not only PC browser, but also smartphones, tablets or future tech).
...made it easy to fix it when you eventually do need to fix it. The source of the problem is pointed out by your message, you will get a meaningful backtrack and all that.
...still wasted no time doing "real" error handling (remember, you never expect the error to occur).

My convention is to prefix such error messages with the word "Paranoia:". This is a clear sign to me and everybody else that I never expect that error to pop off. I can clearly separate them from "real" exceptions. If I see one like that in a GUI or a logfile, I know for sure that I have an earnest problem - I never expected them to occur, after all. At this point I go into crunch mode (with a good chance to solve it quickly and rather easily, as I know exactly where the problem occurred, saving me from a lot of spurious debugging).

I'm sorry if you feel this way about how soon I accepted an answer. In my defense I just didn't knew the question would have >10,000 views and that many answers at the time of the acceptance. Anyway I still haven't changed my mind on the best answer. — Tiago Marinho, Oct 19 '16 at 16:47
@TiagoMarinho, no problem, the comment was not primarily targetted at you personally, and I did not expect you to reconsider. ;) I'm more stumped by the motivations of whoever voted to actually delete my answer... Also, there is quite some downvoting for several answers here without any comments. Not sure if that is the way it's done in this particular area of SE. — AnoE, Oct 20 '16 at 07:02
I wonder if, in this case at least, the treatment is better than the cure. If you're deciding whether to do special handling for a design flaw that you've already identified, it makes sense to compare the full lifecycle cost of a) implementing the error-handling and potentially acting on the error when it occurs by fixing the design, or b) just fixing the design in the first place. — jaxter, Oct 21 '16 at 14:55
@jaxter, exactly. Hence my approach of opening the mind to a bugfix (even if it seems overkill), but when you decide not to fix the bug, then at least implement the fail-fast thing. Obviously, if the fail-fast solution is more expensive than the "real" bugfix in the first place, then avoid it and do the real bugfix. — AnoE, Oct 21 '16 at 14:59

score 1 · Answer 12 · answered Oct 17 '16 at 22:46

A post-it on a senior developer's desk at my workplace says

Does it help anyone?

I think that's often a good starting point for the thought process. There are always lots of things to fix and improve - but how much value are you actually adding? ...whether that's in usability, reliability, maintainability, readability, performance... or any other aspect.

score 0 · Answer 13 · answered Oct 19 '16 at 13:04

Three things come to mind:

First, the impact of an identified bug needs to be thoroughly investigated before the decision to leave the bug in the code can be made in a responsible manner. (In your example I at once thought about the memory leak the ever-growing stack represents and which might make your browser slower and slower with each recursion.) This thorough investigation often takes longer than fixing the bug would, so I'd prefer fixing the bug in most cases.

Second, bugs have a tendency to have more impact than one thinks at first. We are all very familiar with working code because this is the "normal" case. Bugs on the other hand are an "exception". Of course, we all have seen lots of bugs, but we have seen way more working code overall. We therefore have more experience with how working code behaves than with how buggy code behaves. There are gazillions of books about working code and what it will do in which situations. There are close to none about the behavior of buggy code.

The reason for this is simple: Bugs are not order but chaos. They often have a trace of order left in them (or put it the other way round: They don't destroy the order completely), but their buggy nature is a destruction of the order the programmer wanted. Chaos itself tends to defy being estimated correctly. It is way harder to say what a program with a bug will do than what a correct program will do just because it does not fit into our mental patterns anymore.

Third, your example contained the aspect that fixing the bug would mean to have to redesign the program. (If you strip this aspect, the answer is simple: Fix the bug, it should not take too long because no redesign is necessary. Otherwise:) In such a case I'd lose trust in the program the way it currently is designed. The redesign would be a way to restore that trust.

All that said, programs are things which people use, and a missing feature or a second, really cumbersome bug elsewhere can have priority over fixing your bug. Of course then take the pragmatic way and do other things first. But never forget that a first quick estimation of the impact of a bug can be utterly wrong.

Please leave a comment when you downvote. We ought to know what the critique is in order to improve the answer. — Alfe, Oct 20 '16 at 21:45

score 0 · Answer 14 · edited Jun 16 '20 at 10:01

Low probabilty / Mild consequences = Low priorty fix

If the probability of ocurrence is very low
If the consequences of the ocurrence are mild
Then the bug does not pose a threat, then is not a priority fix.

But this can not become a crutch for lazy developers...

What "very low ocurrence" even mean?
What "mild consequences" even mean?

To state the probability of ocurrence is very low and the consequences are mild, the developement team must understand the code, the usage patterns and security.

Most developers gets suprised that things they originally tought will never happen, actually happen a lot

Our educational system doesn't teach probability and logic very well. Most persons, including most software engineers have a broken logic and broken proability intuition. Experience with real world problems and experience with extensive simulations are the only way I know to fix this.

Confront your intuition with real world data

It is important to make several logs to be able to follow the usage patterns. Fill the code with assertions of things you think should not happen. You will get surprised that they do. That way you will be able to confront your intuition with hard data and refine it.

My example of a mild problem and a measure of control

In a e-commerce site I worked a long time ago, another programmer made a mistake: In some obscure condition the system debited the client one cent less than registered in the logs. I discovered the bug because I made reports to identify differences between the logs and the account ballances to make the accounting system more resilient. I never fixed this bug because the difference was very small. The difference was calculated daily and was lower than US$ 2.00 monthly. It so happen that we was developing an entirelly new system that in a year should replace the current one. Make no sense to divert resources from potentially profitable project to fix something that costs US$ 2.00 monthly and was subjected to an appropriated measure of control.

Conclusion

Yes, there are bugs that does not need to be fixed right away, that are not important enough to delay new feature development. However the system must have control of the ocurence of this bug to make sure it is small because we can not trust our own intuition.

score -1 · Answer 15 · answered Oct 20 '16 at 21:50

-1

I think this is asking the wrong question from the start.

The question isn't "should I fix this bug or should I not fix this bug". Any developer has a limited amount of time. So the question is "what is the most useful thing that I can do in one hour, or four hours, or a week".

If fixing that bug is the most useful thing to do, because it improves the software by the largest amount for most people, then fix the bug. If you could make bigger improvements elsewhere, by adding features that people are missing, or by fixing a more important bug, then do these other things. And extra bonus points for anything that makes your future development more efficient.

answered Oct 20 '16 at 21:50

gnasher729

44,814
4
64
126

Not sure the utilitarian metric works best here. Clearly, the video player example was designed to be low-impact, but even that's not foolproof. One answerer already cited the 24/7 promo loop in a lobby Use Case, and another might be a kiosk at a sales/tech convention that runs a week. Both would cost rep and/or money to the business, so non-trivial. – jaxter Oct 21 '16 at 14:57
That would mean fixing the bug provides more benefits than originally expected, so it should go higher up in priorities. So you will have more success in life if your opinion of what is most useful agrees with reality as much as possible. – gnasher729 Nov 02 '16 at 09:21