2

Say I use GUIDs as keys. Considering the chance of a duplicate GUID being generated being what they are, is code like this considered a code smell?

Guid newID = GenerateGuid();
while (dictionary.HasKey(newID))
    newID = GenerateGuid();
dictionary.Add(newID, value);
Cole Tobin
  • 1,469
  • 2
    What kind of GUID (UUID) are you using? IIRC, the only one with any chance of collision is version 4. – JimmyJames Mar 23 '22 at 16:40
  • 6
    I would not consider this as a code smell. I would consider this as a sign of paranoia. See also The Birthday "Paradox" and Guid Collisions – Doc Brown Mar 23 '22 at 16:48
  • 3
    But: "If you think you are too paranoid, you are not paranoid enough" :-) – gnasher729 Mar 23 '22 at 17:13
  • 3
    I knew I had forgotten something really important... GUIDs are supposed to be Globally Unique. It's not about the 10,000 GUIDs that you generate, its about the billions of GUIDs that others generate that might enter the same system as yours. So checking is pointless, because you only check a tiny, tiny portion of all GUIDs. – gnasher729 Mar 23 '22 at 18:11
  • 1
    All this does is avoid an exception (maybe). Look up TryAdd(). Does most of this for you and it’s already debugged. – candied_orange Mar 23 '22 at 21:11
  • 6
    The main problem here is not the paranoia of the extra check, but the fact that the check does not fulfil the promises that paranoia requires. In theory (or in practice with an inappropriate UUID implementation) if two nodes could generate at the same time a duplicate UUID (step 1), they would both not find it as duplicate in step 2 (because not yet added), and one would fail when adding (step 3). Definitively, some worries for concurrency and a tryAdd() should be recommended here, and this is the real smell !! – Christophe Mar 23 '22 at 23:23
  • 1
    @Christophe is absolutely right. Depending on your language, you need a lock, semaphore, synchronized block, or similar. – user949300 Mar 25 '22 at 05:51

3 Answers3

6

If UUIDs are generated correctly, it is practically impossible to have collisions. Checking for collisions is then a waste of time.

The check is not a code smell per se, but it suggests that the programmer didn't understand UUIDs – a red flag that invites closer scrutiny of the surrounding code.

If you are concerned about collisions, it would be better to look at your GenerateGuid() function in detail. Which UUID version does it use? If it is a random-based UUID, is the randomness source cryptographically secure? If it is a MAC-based UUID, how are the MAC addresses allocated? If it is a namespace/hash UUID, then of course uniqueness depends solely on uniqueness of the input data.

amon
  • 134,135
  • 3
    I understand UUIDs, but what about the programmer who implemented them? Feel free to check and fire sn assertion if there are duplicates. Especially in portable code that uses different GUID generators and you can't check them all. – gnasher729 Mar 23 '22 at 17:12
  • 3
    Is "practically impossible" a sufficient guarantee in life critical systems? Isn't premature optimization the root of all evil (i.e. is the overhead of the extra belt-and-braces check really relevant in all the use cases)? ;-) Btw, the real smell is that the algorithm used to avoid dupplicates, does not avoid duplicates in a concurrency context (see my comment under the question). – Christophe Mar 23 '22 at 23:26
  • 2
    @Christophe The chance of collisions are much smaller than e.g. a RAID array with a database becoming unrecoverable. I could assign every living human one UUID v4 per second, and it would take on average 11 years to see a collision. Most systems don't handle billions of billions of objects, though. Your point about concurrency is extremely good, and echoes the closing remark of the link Doc posted: “There's a much higher chance of something else being wrong … indeed, adding code to work around guid collisions is quite likely to increase the likelihood of a bug in your program.” – amon Mar 23 '22 at 23:36
  • 1
    I remember an enterprise backup system that calculated checksum and only copied files once if multiple files had the same checksum. So it would be possible that a file wouldn’t be backed up because it had the same checksum as another file. The chance was calculated to be lower than the chance of a meteorite hitting the data centre, and another meteorite hitting your computer. – gnasher729 Mar 25 '22 at 21:29
3

You can reasonably expect that an UUID is unique and that the probability of collision is extremely low, as Amon already explained.

However, if life and death depend on this uniqueness, for example in large mission-critical systems that are meant to be up and running for very long time, you could consider the extra check to prevent harm. In the end, it's up to you to decide how to best balance the cost of the extra-verification of an improbable issue and the cost of the impact if the event would occur.

If you do the check, you should do it well. You may for example consider TryAdd(). Not only is it shorter and conveys better your intent, it's also more suitable for concurrency:

do newID = Guid.NewGuid();      
  while (! dictionary.TryAdd(newID, value));   // check and add in one go

Because, your snippet would not work concurrently unless you'd use some locking scheme that would immediately create a concurrency bottleneck. What could for example go wrong if in theory two threads or nodes could generate at the same time a duplicate UUID (step 1), is that they would both not find their UUID as duplicate in your while condition (step 2) as it not yet added, and one thread would fail when adding the key (step 3).

Christophe
  • 77,139
  • 1
    Concurrency is not a blanket given. We don't have enough of a context here to decide that the dictionary is being shared between contexts. Concurrency-friendly implementations can be complex and blindly implementing it by default starts brushing up against YAGNI. – Flater Mar 25 '22 at 09:05
  • 1
    @Flater If there is no concurrency, and everything runs on a single thread on a local computer, there are simpler ways to generate a (locally) unique identifier than going for a universal UUID, isn’t it? – Christophe Mar 25 '22 at 12:08
  • 1
    You don't have to use the same machine/runtime in order to run into conflicts. For example, services scaled across multiple instances will run into similar potential conflicts when the data eventually gets merged into the same datasource. – Flater Mar 25 '22 at 12:25
  • 1
    @Flater the rest of my answer stays valid in a non-concurrent case. I've edited to clarify why TryAdd() would still be the better option even in a non-concurrent context. – Christophe Mar 25 '22 at 18:01
  • Not sure why you felt the need to delete and repost the exact same comment hours later. – Flater Mar 25 '22 at 18:57
2

Code smells imply bad code. At worst, your approach is unnecessary.

You'll find that generally speaking, you're not expected to double check the uniqueness of a GUID/UUID. They are by design astronomically unlikely to yield conflicts.

That being said, there can always be fringe reasons that warrant it. For example, the software may drive a critical process that should make every attempt at preventing failure (NASA and medical devices are easy examples here); or you could be using a custom UUID generation method that is proven to be as unique as you'd expect (however, I would be more inclined to tackle that uncertainty using testing rather than adding checks in the production code).

Flater
  • 49,580