128

Imagine you are creating a video player in JavaScript. This video player loops the user's video repeatedly. Each time a new loop begins the player runs a recursive function that calls itself N times, N being the number of times the video has looped, and because of that the browser will trigger a too much recursion RangeError at some time.

Probably no one will use the loop feature that much. Your application will never throw this error, not even if the user left the application looping for a week, but it still exists. Solving the problem will require you to redesign the way looping works in your application, which will take a considerable amount of time. What do you do? Why?

  • Fix the bug

  • Leave the bug

Shouldn't you only fix bugs people will stumble in? When does bugfixing become overkill, if it ever does?

  • 9
    Btw, since video playing is an asynchronous task you will not get a stack overflow from a recursive approach at all. So for this particular example: no, the problem doesn't even exist :-D – Bergi Oct 17 '16 at 13:47
  • 95
    Don't mess with my example case scenario mate – Tiago Marinho Oct 17 '16 at 13:48
  • 5
    You are starting on an wrong assumption that this is even a bug. – PlasmaHH Oct 17 '16 at 14:49
  • 15
    @PlasmaHH I'm using this hypothetical scenario to explain my question. If the bug exists or not it doesn't matter at all – Tiago Marinho Oct 17 '16 at 14:53
  • 13
    @TiagoMarinho: the point I am trying to make is: sometimes its just the right thing to do to define such a scenario as the intended behaviour. – PlasmaHH Oct 17 '16 at 14:54
  • 9
    You start with assumptions, make very sure those assumptions are right, because hey your video player I use for my customers will give me headaches because I'm running a station in kiosk mode for weeks on end, defeating your use case. At the very least document the behaviour. – Pieter B Oct 17 '16 at 16:44
  • 24
    Why on Earth would you run such a loop using recursion in the first place? You might not want to fix the bug, but you sure ought to reconsider your design process :-) – jamesqf Oct 17 '16 at 17:09
  • 11
    Most commercial software development won't ever fix a bug in a piece of released software that does not affect a customer (i.e. is not reported by a customer but found in-house). There's simply no budget for such fixes. – tofro Oct 17 '16 at 17:27
  • 28
    This seems more like a business question. You have to prioritize based on the cost-to-fix, and the impact/frequency of the bug. – Casey Kuball Oct 17 '16 at 17:28
  • 4
    Thou shalt not suffer a bug to live! – Mawg says reinstate Monica Oct 18 '16 at 07:45
  • 4
    As a developer, you document the issue as a bug report and let management decide when to fix it. – Simon Richter Oct 19 '16 at 10:53
  • 1
    I'd hate to be the developer that has to troubleshoot this problem when a customer has a glitch and only .03 seconds of video comes through so you're looping many times a second causing a very strange early failure. The cost of finding and fixing at this point is staggering, would have been comparatively free to just fix the bug in the first place--just a few man-days of dev work vs man-weeks of dealing with customers, debugging, analyzing, etc... and you would have ended up with better code! – Bill K Oct 19 '16 at 21:47
  • 3
    can i just point out that 2^53 is quite a large number. Even with a 1 second video it would take 104,249,991,374 days of 24/7 video playing for the bug to appear. It was already pointed out that the async nature means the bug doesn't really exist, but even if the bug did exist, the user's screen would stop working long before anyone saw this bug actually happen. Actually, after 285,616,414 years it's not clear human civilisation will still exist... – David Meister Oct 20 '16 at 13:46
  • @DavidMeister that's exactly the point: Is it overkill to fix a bug that is so unlikely to be triggered? Others have taken the example literally and pointed out that such a bug could be triggered more often if the user's video lasts like 0.01 seconds, missing the whole point of the question. But you get the idea. – Tiago Marinho Oct 20 '16 at 14:02
  • 1
    And yes, I may have exaggerated the example case a bit, since 2^53 loops is unreachable. Nothing is going to last that long anyway so the program will crash for something else first. – Tiago Marinho Oct 20 '16 at 14:06
  • Just gonna throw this out here: not all bugs are necessarily difficult to fix. For instance, there is a thing called tail recursion which makes recursion compile into loops. I know this is just an example, but it is an important thing to consider: sometimes fixing the most obscure and unreachable bug leads to the development of something fare more useful in other areas. Would I go and rewrite the entire video player? Absolutely not. However, if hypothetically tail recursion optimization didn't exist yet, then solving this bug might lead to creating tail recursion optimization. – user64742 Oct 21 '16 at 02:22
  • 1
    tl;dr don't bugfix excessively if it costs you money or time, but if you have the time and you just want to fix a difficult bug, then you might as well try. You might find out something awesome in the process that you can share with other people in similar situations. Obviously I'm not referring to proprietary things. I simply mean if you find out something useful for programming in general by solving a really nasty bug, then it might be something you may or may choose to share. But hey, now your code lacks that nasty bug everyone else's video player has! – user64742 Oct 21 '16 at 02:25
  • @TiagoMarinho no, I don't get the idea. This example is not a bug unless you're willing to classify all software as a bug. Any software that you run for 2^53 * 0.1 seconds would destroy any hardware that it runs on through simple wear and tear. We're talking 20 million years of constant usage after all. The problem is that if we're willing to concede that the example is in fact "a bug" then there is nothing that is not a bug. It's like saying that UUIDs are "a bug" because there is a chance of collision... – David Meister Oct 21 '16 at 14:53
  • @TiagoMarinho the problem here is that any sensible definition of a "bug" or "defect" comes with an idea of both risk and impact. If either the risk or impact are literally zero within the lifetimes of literally everyone who will ever come in contact with the software then it isn't a defect at all. Now, we can say that there are things that are bugs with a almost-but-not-quite-zero risk/impact, and that is fair, but this example is nowhere near that scale. The thing is, once we are talking about something with measurable risk/impact, the answer to the question becomes self evident... – David Meister Oct 21 '16 at 15:01
  • @TiagoMarinho perhaps a difficulty with the question is the assumption that any code that could be theoretically reached and throws an error or otherwise halts/impairs the system must be "a bug" but this is not the case. Consider the example of creating a random number generator without modulo bias. To fix this very real and potentially quite serious bug (a predictably biased prng) you must introduce a non-zero chance that your code will hang indefinitely - http://stackoverflow.com/questions/10984974/why-do-people-say-there-is-modulo-bias-when-using-a-random-number-generator – David Meister Oct 21 '16 at 15:12
  • @TiagoMarinho but the recursive approach is also not a bug in javascript due to the async nature of the operation... – David Meister Oct 22 '16 at 00:23

15 Answers15

165

You have to be pragmatic.

If the error is unlikely to be triggered in the real world and the cost to fix is high, I doubt many people would consider it a good use of resources to fix. On that basis I'd say leave it but ensure the hack is documented for you or your successor in a few months (see last paragraph).

That said, you should use this issue as a "learning experience" and the next time you do looping do not use a recursive loop unnecessarily.

Also, be prepared for that bug report. You'd be amazed how good end users are at pushing against the boundaries and uncovering defects. If it does become an issue for end users, you're going to have to fix it - then you'll be glad you documented the hack.

mcottle
  • 6,142
  • 2
  • 25
  • 27
  • 1
    Note that the video-player and it's looping feature are completely hypothetical, hahah. But yeah, I agree. – Tiago Marinho Oct 17 '16 at 05:49
  • 122
    Totally agree with "You'd be amazed how good end users are at pushing against the boundaries and uncovering defects." – Spotted Oct 17 '16 at 07:05
  • 77
    End users are in no way restricted by what you think is a reasonable use of your software. There will be users who want to loop a video forever. It's a feature that your software provides, so they will use it. – gnasher729 Oct 17 '16 at 09:01
  • 37
    @gnasher729 "10-hour XXXX" videos on Youtube is a good identifier that indeed, some people just want to loop something forever. – Chris Cirefice Oct 17 '16 at 13:16
  • 24
    Another problem: If your software is popular, then someone encounters a bug that indeed happens in a rare situation only, posts it on the internet, and suddenly everyone and their dog says "this software is rubbish, it crashes if I loop a video for a day". Or a competitor uses it to demonstrate how easy it is to crash your application. – gnasher729 Oct 17 '16 at 13:19
  • 2
    Could have ended the answer after the first sentence. – Ant P Oct 17 '16 at 15:40
  • 1
    This is the answer. Stick it in the bug tracker and fix it when you can. –  Oct 17 '16 at 20:59
  • 4
    Emphasis on the last paragraph. Did you know that MacOS Classic would crash if it received 32,768 consecutive "mouse press" events without an intervening "mouse release" event? – Mark Oct 17 '16 at 21:20
  • 1
    @Mark That is absolute proof of "how good end users are at pushing against the boundaries and uncovering defects" I know I'm amazed... – Jerry Jeremiah Oct 18 '16 at 01:20
  • 3
    Such code could easily by installed on some embedded machine in a mall or maybe consumer-grade fridge displays and cause freezes (pun not intended) after a week of non-stop looping. – user1306322 Oct 18 '16 at 13:30
  • 5
    I agree this is the best answer but pjc50's answer below should be considered as well. Ask yourself: What is the worst thing that could happen? Suppose your player was used to display some kind of safety warning at the entrance to a facility. The player crashes and the warning ceases, then someone is hurt or even killed because they never got the warning. Be sure to include a very clear disclaimer that the software should not be used for mission critical applications. Also be sure to HANDLE the error condition and not just allow it to crash. – O.M.Y. Oct 18 '16 at 14:00
  • 2
    Realize that this actually is NOT pragmatic to ignore in this case. True, edge cases are generally things you document and ignore largely, but this is NOT an edge case in the truest sense. Is it rare? Maybe, but if the application is a video looper, it is a huge problem if there is a bug in the video looper! There is literally a point in which the application crashes during the desired use. Edge cases are cases where the user is using it for an unintended purpose, not using it longer than intended. If this was real, it should be something fixed immediately, because it is the focus of the app. – EvSunWoodard Oct 19 '16 at 19:08
80

There was a similar bug in Windows 95 that caused computers to crash after 49.7 days. It was only noticed some years after release, since very few Win95 systems stayed up that long anyway. So there's one point: bugs may be rendered irrelevant by other, more important bugs.

What you have to do is a risk assessment for the program as a whole and an impact assessment for individual bugs.

  • Is this software on a security boundary?
  • If so, can this bug result in an exploit?
  • Is this software "mission critical" to its intended users? (See the list of things the Java EULA bans you from using it for)
  • Can the bug result in data loss? Financial loss? Reputational loss?
  • How likely is this bug to occur? (You've included this in your scenario)

And so on. This affects bug triage, the process of deciding which bugs to fix. Pretty much all shipping software has very long lists of minor bugs which have not yet been deemed important enough to fix.

pjc50
  • 13,377
  • 1
  • 31
  • 35
  • 2
    I also recall the (hardware) bug in some Intel CPUs where a specific floating point value went all wrong. –  Oct 17 '16 at 16:11
  • 5
    @WilliamKappler https://en.wikipedia.org/wiki/Pentium_FDIV_bug is what I believe you are referring to. Was up for a year before anybody noticed it. – Jeutnarg Oct 17 '16 at 20:20
  • @Jeutnarg I thought it was more recent than that, but I don't remember the details. I could be conflating a few different bugs. –  Oct 17 '16 at 20:38
  • @william Maybe you're thinking of the TSX transactional memory bug which resulted in the instructions being disabled in a microcode update. – Jeffrey Bosboom Oct 17 '16 at 21:51
  • 1
    That 49.7 day bug cost Microsoft dearly in reputation. – gnasher729 Oct 17 '16 at 22:51
  • 10
    @gnasher729 - Not really, they were already at the bottom and still digging :) Most people had to re-install Win 95 more frequently than 49.7 days IIRC. – mcottle Oct 18 '16 at 03:15
  • 1
    @mcottle Silly. One, this has nothing to do with reinstalls. Two, of course a home system wasn't designed to run 24/7. Windows 95+ had to make a lot of cuts to make things barely work on the typical home computer. Did your Commodore 64 run for 60 days? Or your AtariST? Or your Amstrad CPC? Or your Amiga? Do you think Unixes of the time survived that long? Apple IIe could, but that was a business machine as well - it's not like Windows NT had the problem. If you hosted a server on Windows 95, you were an idiot (or you were doing it for fun :)). Windows 95 won the home computer market. – Luaan Oct 18 '16 at 08:09
  • 4
    @Luaan The comment was intended as a lighthearted dig at M$, hence the smiley after the first sentence. They were behind the eightball with '95 because it came out very late in 95 (probably because having Win95 released in 1996 would have been a bad look), half baked (Remember the USB BSOD?) and inclined to become unstable and require regular reinstalls hence my second sentence - which never mentioned running a server on Windows 95, I don't know where you got THAT from (flashbacks?). The second release CD improved matters but the initial release of '95 was a doozy. – mcottle Oct 18 '16 at 08:26
  • 5
    TBH I think it was the "Windows for Warships" fiasco that did more reputational damage ( http://archive.wired.com/science/discoveries/news/1998/07/13987 ) and that was using NT. Unix machines of that time could manage multi-year uptimes, even using (very early) versions of Linux. All the home computers were also capable of high uptime, although rarely used that way. I saw BBC micros embedded in educational exhibits a decade after they were obsolete. – pjc50 Oct 18 '16 at 08:59
  • @Luaan: yes, Unix/Linux desktop and server systems of the time did routinely manage uptimes many times longer than that. You're right about other systems like the Atari ST, though. IDK if I ever left mine on for that long, but given the lack of memory protection in the ST / STe, I think I usually ended up rebooting more often than that (since I played lots of games and used multiple TSR programs that could interact in weird ways and leave the system unstable). – Peter Cordes Oct 18 '16 at 20:46
  • https://en.wikipedia.org/wiki/MIM-104_Patriot#Failure_at_Dhahran is another classic example of rounding errors building up when something was left running longer than expected. – armb Oct 19 '16 at 15:51
  • 1
    @PeterCordes Although I loved my Atari 1040 STF I was never going to mistake it for MilSpec. It once locked up running Cubase in the time it took me to eat a meal. The drum loop that couldn't be killed - without a hard reset - which got me running back to Steinberg Pro 24. Never had the guts to gig it. IIRC I had to boot it up using a 3.5" floppy. Can't remember if that was the O/S or just my config. Awesome value for the era and the paper white monitor was great. A good option for impoverished students that couldn't afford a Mac. – mcottle Oct 20 '16 at 05:29
  • @mcottle: IIRC, the whole OS was in ROM, but everyone had their favourite terminate-and-stay-resident 3rd-party software on a boot disk. My dad had a Mega4 STe (4MB RAM vs. 1MB in the ST 1040, and 16MHz CPU vs. the 8MHz in the ST), with an 80MB hard drive, which was pretty awesome. When I was first learning C, before I got a PC to Linux, I was using gcc on the Atari. I taught myself assembly language mostly because the compile times were so high (while I was tweaking a Mandelbrot program I typed in from a book, to make it run faster), before realizing you could beat the compiler :P. – Peter Cordes Oct 20 '16 at 06:52
33

The other answers are already very good, and I know your example is just an example, but I want to point out a big part of this process that hasn't been discussed yet:

You need to identify your assumptions, and then test those assumptions against corner cases.

Looking at your example, I see a couple assumptions:

  • The recursive approach will eventually cause an error.
  • Nobody will see this error because videos take too long to play to reach the stack limit.

Other people have discussed the first assumption, but look at the second assumption: what if my video is only a fraction of a second long?

And sure, maybe that's not a very common use case. But are you really sure that nobody will upload a very short video? You're assuming that videos are a minimum duration, and you probably didn't even realize you were assuming anything! Could this assumption cause any other bugs in other places in your application?

Unidentified assumptions are a huge source of bugs.

Like I said, I know that your example is just an example, but this process of identifying your assumptions (which is often harder than it sounds) and then thinking of exceptions to those assumptions is a huge factor in deciding where to spend your time.

So if you find yourself thinking "I shouldn't have to program around this, since it will never happen" then you should take some time to really examine that assumption. You'll often think of corner cases that might be more common than you originally thought.

That being said, there is a point where this becomes an exercise in futility. You probably don't care if your JavaScript application works perfectly on a TI-89 calculator, so spending any amount of time on that is just wasted.

The other answers have already covered this, but coming up with that line between "this is important" and "this is a waste of time" is not an exact science, and it depends on a lot of factors that can be completely different from one person or company to another.

But a huge part of that process is first identifying your assumptions and then trying to recognize exceptions to those assumptions.

  • Very good point Kevin. Note my comment on the selected answer above that focuses on the analysis question What's the worst thing that could happen? – O.M.Y. Oct 18 '16 at 14:08
  • Another assumption here is that an ever-growing stack will only lead to problems when it reaches an overflow size. In fact, the stack can be a normal resource this bug is constantly leaking. The whole browser could become slower and slower by tiny bits on each iterat^H^H^H^H^H^Hrecursion. – Alfe Oct 19 '16 at 12:31
  • The OP never said the problem was caused by a growing stack. It could just as easily be caused by an error in a counter routine (dec --> div/0 ?). 2. If the problem is a stack overflow problem then shouldn't this question be posted in StackOverflow? <rimshot!> ;-D
  • – O.M.Y. Oct 20 '16 at 15:30
  • @O.M.Y. Who is that comment directed towards? – Kevin Workman Oct 20 '16 at 15:32