42

I've been implementing a network protocol, and I require packets to have unique identifiers. So far, I've just been generating random 32-bit integers, and assuming that it is astronomically unlikely that there will be a collision during the lifespan of a program/connection. Is this generally considered an acceptable practice in production code, or should one devise a more complex system to prevent collisions?

Phoenix
  • 758
  • 47
    Why is using a sequential integer not going to cut it? – whatsisname Dec 30 '16 at 03:18
  • 20
    Why don't you just use an incrementing int? GUIDs, which are designed to have the uniqueness properties you describe, are 128 bits in size, not 32. – Robert Harvey Dec 30 '16 at 03:19
  • 1
    Because multiple computers will need to create IDs that are unique among the set of IDs created by any connected computer – Phoenix Dec 30 '16 at 03:20
  • 1
    Can you use GUIDs? – Robert Harvey Dec 30 '16 at 03:20
  • If the answer is that 32 bits of random data isn't enough and I should use 128 bits, I'd accept that – Phoenix Dec 30 '16 at 03:22
  • 21
    Alternatively, assign a channel number to each connected computer, and use an incrementing sequence id. The two numbers combined (with the channel number taking up the high-order bits) becomes your new unique id. – Robert Harvey Dec 30 '16 at 03:23
  • Then just partition out ranges of the IDs to each computer. – whatsisname Dec 30 '16 at 03:25
  • Also, what is the lifespan of a program/connection in this situation. – whatsisname Dec 30 '16 at 03:29
  • 1
  • There is not necessarily a cap to the lifespan of the program. – Phoenix Dec 30 '16 at 03:31
  • If there is no cap to the lifespan, you will run out of unique ids, regardless of whether they are 32, 128, or 2048 bits long. You will need a method of reusing ids. It could be as simple as odd ids are generated during odd hours and even ids are generated during even hours, guaranteeing at least 60 minutes before an Id gets reused. Or a command by the central server to switch to the alternate set. – AJNeufeld Dec 30 '16 at 05:01
  • 4
    @AJNeufeld: A sequential id, 2048 bits long, would require more packets than atoms in the universe by hundreds of orders of magnitude to encounter a collision. If you're going to make absolute statements, at least make them actually reasonable. – whatsisname Dec 30 '16 at 06:02
  • 1
    @whatsisname (You mean, observable universe.) Humour isn't acceptable? Besides, we are talking time, not number of atoms, and without a cap on the lifespan (such as the heat death of the universe), you will eventually hit the limit. As for reasonable, a clear mechanism for reuse was a reasonable statement. – AJNeufeld Dec 30 '16 at 06:32
  • I might be wrong here, but IIRC what TCP/IP does is generate a random start value for a new connection, then uses sequential ids starting from that point. – CompuChip Dec 30 '16 at 10:29
  • 1
    The comment of @Basilevs is on point here. This is a solved problem by TCP. I suggest you take a look into http://www.ietf.org/rfc/rfc1185.txt and http://www.ietf.org/rfc/rfc1323.txt as stated in the answer linked by Basilevs. – Machado Dec 30 '16 at 11:28
  • 1
    @Phoenix I don't understand. Why do you think random IDs are less likely to collide than sequential ones? There's an obvious argument for the reverse -- at least with sequential IDs, a machine won't collide with itself. With random ones, you can collide with yourself. Chances of colliding with other machines are unaffected by this choice. No? – David Schwartz Dec 30 '16 at 12:05
  • 27
    If your "random number generator" guarantees that a particular number will not be repeated until every other number has been generated, it is a very poor random number generator! By the same logic, the only possible "random" sequence of coin tosses would be HTHTHTHTHT.... – alephzero Dec 30 '16 at 14:51
  • @DavidSchwartz I was thinking that if they are sequential with random starting place, and the one with the lesser start is generating faster, it will eventually reach a range where all IDs are duplicates, which is horrible. – Phoenix Dec 30 '16 at 16:21
  • 17
    "I require packets to have unique identifiers" What is the consequence of this requirement being violated? If you require unique identifiers, in the strictest reading of the word, you must have a centralized system dolling out identifiers (like how MACs are assigned to individual network card companies). Most likely you have a softer definition of "require." Understanding that level of softness will dramatically change the answers you receive. – Cort Ammon Dec 30 '16 at 17:44
  • @AJNeufeld: if you're trying to go for humor, you need to make it more obvious, there are more people on this site than you and me, people with a wide variety of skill levels that might not be able to recognize your 'humor'. – whatsisname Dec 30 '16 at 19:23
  • @Phoenix Ahh, so the amount of harm scales with the number of duplicates? And one or two duplicates is just fine? That changes everything. – David Schwartz Dec 30 '16 at 21:31
  • 1
    @alephzero, linear congruential generators are a popular form of simple RNG that guarantee non-repetition. However, the guarantee only holds if you're using the full output range. If you fold the range down to produce a series of coin flips, it'll look random. – Mark Dec 30 '16 at 21:46
  • @Mark Arguably if you can share state you can just use sequential integers. I'd assume OP means the output from independent RNG. – Maja Piechotka Dec 31 '16 at 03:35
  • @Mehrdad 2^32 billion dollars? I'd be happy even with 2^-10 billion dollars ... – Hagen von Eitzen Dec 31 '16 at 12:31
  • @HagenvonEitzen: Whoops!! Shame on me, I've fixed and re-commented haha, thanks for catching that. – user541686 Dec 31 '16 at 12:55
  • Dude, I think you need to rethink what "astronomical" means. Astronomical is (pretty much by definition) something that is unlikely to be ever within human (i.e., earthly) reach. Even without the birthday collision issue, 4 billion is not an "astronomical" number. There are more people on Earth than that, and even my 2" flash drive can hold 32 times that many bytes. Trump would be insulted if you told him he's worth 2^32 dollars (well, maybe not; I'm generously assuming he'd understand). An astronomical number would be, like, at LEAST 2^64 (but probably more like 2^128 or larger)... – user541686 Dec 31 '16 at 12:56
  • The idea of using a GUID has already been suggested. Is there anything that holds you back? – Theodoros Chatzigiannakis Dec 31 '16 at 12:56
  • @CortAmmon: That's not really accurate. If you use 256-bit random identifiers, for example, the probability of a collision is much smaller than the probability of a neutrino hitting the ram in your central authority and causing it to issue the same identifier more than once. – R.. GitHub STOP HELPING ICE Jan 01 '17 at 01:52
  • @R.. That probability is greater than 0. Hence why I pointed to the need to clarify the meaning of the word. There are applications where a 256-bit random identifier would not be accepted, not because they wouldn't work, but because the product is being held to a specification that requires a deterministic proof that no collisions occur. – Cort Ammon Jan 01 '17 at 03:40
  • @CortAmmon: I understand that but think it's a mistaken specification, because there's no deterministic proof that a physical computing device produces the same results, with 100% certainty, as the formal model it was designed to implement. – R.. GitHub STOP HELPING ICE Jan 01 '17 at 06:03
  • @R.. That is true, but that wont get you paid =) And, honestly, if I were designing the safeguards on a nuclear reactor, I'd be a bit picky about "random probabilities" anyways. I'd be demanding a deterministic solution. – Cort Ammon Jan 01 '17 at 06:17

10 Answers10

143

Beware the birthday paradox.

Suppose you are generating a sequence of random values (uniformly, independently) from a set of size N (N = 2^32 in your case).

Then, the rule of thumb for the birthday paradox states that once you have generated about sqrt(N) values, there is at least a 50% chance that a collision has occurred, that is, that there are at least two identical values in the generated sequence.

For N = 2^32, sqrt(N) = 2^16 = 65536. So after you have generated about 65k identifiers, it is more likely that two of them collide than not! If you generate an identifier per second, this would happen in less than a day; needless to say, many network protocols operate way faster than that.

nomadictype
  • 1,652
  • 11
    +1. In my last job, one of our partners actually used this approach to generate random identifiers (not for networking packets, but for a shared business object ultimately created by end customers). When I queried the data with an eye toward this, I found that on average, there were two to three pairs of duplicates every day. (Fortunately, this only broke things if the duplicates were created within four hours of each other, which happened a bit less often. But still.) – ruakh Dec 30 '16 at 18:28
  • 6
    (click here to render math) For what it's worth, the $\sqrt{N}$ approximation is accurate up to a constant factor; for $N = 2^{32}$, the actual threshold is 77164, as this is the smallest value of $n$ such that $\prod_{k=1}^{n-1} (1 - k / N) < 1 / 2.$ – wchargin Dec 30 '16 at 22:07
  • 4
    @wchargin: There's really nothing magical about the probability hitting 0.5; what's notable is that the probability is increasing relatively fast with increasing N. If 32-bit identifiers would have a slight but non-trivial chance of a random collision, a 40-bit identifier would have almost none. – supercat Dec 31 '16 at 01:13
  • 3
    @supercat: That's all true. I just figured that if one provides such a constant, one might as well give an accurate value :-) – wchargin Dec 31 '16 at 01:26
  • 2
    @wchargin: I prefer to think in terms of where one needs to start worrying about duplicates. If one goes much below sqrt(N) the probabilities of collisions drop off rapidly, to the point that one can safely say that they won't happen unless there is a severe defect in the random generator. – supercat Dec 31 '16 at 01:53
  • 2
    @wchargin -- You were technically correct, which (as we all know) is the best kind of correct. Also, even at 77,164, this approach is so flawed as to belong in a 1st year book of Things You Shouldn't Do. – Peter Rowell Jan 01 '17 at 23:55
  • @supercat, there is something magical about 0.5. It's the point at which you can tell someone they've thrown their fate open to a coin-toss. People can understand that much more clearly than "it gets bad fast", and making people understand is magical. – sh1 Mar 07 '17 at 22:13
  • @sh1: If one needs to reduce the probability of any duplicates to below the probability of a person winning a state lottery two weeks in a row (buying one ticket per week), the number of items one can afford to generate before the risk becomes unacceptable would be less than sqrt(N), but I think each fourfold increase in (N-1) would more than double the number of items one could use beyond the first (until the sample space includes over a trillion items, the risk of duplication would be zero if one picked one item, and unacceptably large if one picked two or more). – supercat Mar 08 '17 at 00:15
  • @supercat, if everybody had a strong intuitive understanding of lottery statistics there would be no lotteries, so I don't think it's a good basis for explanation. – sh1 Mar 08 '17 at 17:42
12

It is widely considered acceptable to rely on random numbers being unique if those numbers have enough bits. There are cryptographic protocols where repeating a random number will break the entire security. And as long as there aren't serious vulnerabilities in the random number generator being used, that hasn't been a problem.

One of the algorithms for generating UUIDs will effectively generate an ID consisting of 122 random bits and assume it will be unique. And two of the other algorithms rely on a hash value truncated to 122 bits being unique, which has roughly the same risk of collisions.

So there are standards relying on 122 bits being enough to make a random ID unique, but 32 bits is definitely not enough. With 32 bit IDs it only takes about 2¹⁶ IDs before the risk of a collision reaches 50% because with 2¹⁶ IDs there will be close to 2³¹ pairs each of which could be a collision.

Even 122 bits is less than I would recommend in any new design. If following some standardization is important to you, then use UUIDs. Otherwise use something larger than 122 bits.

The SHA1 hash function with an output of 160 bits is no longer considered secure which is in part because 160 bits is not enough to guarantee uniqueness of the outputs. Modern hash functions have outputs from 224 to 512 bits. Randomly generated IDs should aim for the same sizes to ensure uniqueness with a good safety margin.

kasperd
  • 289
  • 12
    SHA-1 is considered insecure because there are specific attacks (i.e. non-random) against the algorithm itself that can find collisions faster than brute force, not because there's a high chance of a random collision. A rough estimation says that with 122 bits and a generation rate of 1 billion (10^9) IDs per second, it would take over 73 years before reaching a 50% chance of a collision. – 8bittree Dec 30 '16 at 16:33
  • 1
    sqrt(2^122) = 2.3 quadrillion quadrillion UUIDs – noɥʇʎԀʎzɐɹƆ Dec 30 '16 at 23:20
  • 2
    @8bittree The bitcoin network computes 2⁷⁰ SHA2 hashes every 10 minutes. Had that been SHA1 hashes it would only take a week to produce a collision. If UUIDs were produced at the same speed that bitcoin computes hashes it would take less than 2 seconds to produce a collision. – kasperd Dec 30 '16 at 23:32
  • 1
    Bitcoin is all about trying to find collisions, and is immensely popular and has had dedicated hardware designed specifically for finding hashes. Now, sure, if the OP is planning to create a wildly popular cryptocurrency, or something similar, then they might need hundreds or thousands of bits per ID. But immediately assuming that those are the requirements might be encouraging far more work than necessary if a standard UUID library is sufficient. – 8bittree Dec 30 '16 at 23:40
  • 1
    @8bittree If using standard libraries is any advantage, then by all means go for UUID. But pulling some random bytes out of urandom is not more work than using a UUID library. I just implemented both in Python for comparison, and each method was exactly 25 characters of source code. – kasperd Dec 30 '16 at 23:52
4

It depends on both the probability of failure and the consequences of failure.

I remember a debate between software and hardware people where the hardware people considered that an algorithm with a small probability of wrong results (something like 1 failure in 100 years) was acceptable, and the software people thought this was anathema. It turned out that the hardware folks routinely calculated expected failure rates, and were very used to the idea that everything would give wrong answers occasionally, e.g. due to disturbances caused by cosmic rays; they found it strange that software folks expected 100% reliability.

Michael Kay
  • 3,474
3

I would call this bad practice. Random number generates simply don't create unique numbers, they just create random numbers. A random distribution is likely to include some duplicates. You can make this circumstance acceptably unlikely by adding in an element of time. If you get the current time from the system clock in milliseconds. Something like this:

parseToInt(toString(System.currentTimeMillis()) + toString(Random.makeInt()))

Will go a long way. Obviously to truly guarantee uniqueness you need to use UUID/GUID. But they can be expensive to generate, the above is likely sufficient, as the only possibility of overlap, is if the random generate had a duplicate in the same millisecond.

Fresheyeball
  • 250
  • 1
  • 10
  • 9
    1ms can be a long time in some systems. – quant_dev Dec 30 '16 at 12:36
  • 7
    This doesn't actually decrease the chance of collision at all. The probability of a collision after N numbers is exactly equal to that of the OP's original solution. The trick of using the current time as a seed is typically used when assigning keys sequentially. – Cort Ammon Dec 30 '16 at 17:42
  • 1
    @CortAmmon I've not done the math in some time, but I am confident it decreases the probability of collision. I'm suggesting an incremental improvement. In the end, you can't avoid collision entirely without UUIDs. Apparently, this guy is not having bad collision problems today with just random numbers, so I think is approach is a good idea. – Fresheyeball Dec 30 '16 at 18:22
  • 2
    @Fresheyeball I am confident that it has no effect, unless Random.makeInt() does not actually generate a uniform distribution from the integer's minimum value to the integer's maximum value. For every past value generated by this function, there is a random value from makeInt which, for this exact time step, generates that value, creating a collision. Since all values from makeInt are equiprobable, the probability of a collision is exactly equal to that of the probability of a collision without the addition of time. – Cort Ammon Dec 30 '16 at 19:50
  • 1
    I have seen/used/thought about this approach but usually in the sense of a string concatenation (so timepart_randomPart) - if the random integer was toString'd with zero padding to make it consistently appear in the "lower 32 bits" of the resulting value (which would have to be a long long, since current time milli is a long) then you are basically offsetting random blocks with an incrementing counter. – Mikeb Dec 30 '16 at 20:45
  • 2
    @CortAmmon this isn't using the current time as a seed, and it does definitely make a difference as long as those N numbers weren't all generated during the same millisecond, because two numbers with different timestamp parts never collide. If you imagine the other answer's example of one packet per second having a 50% chance of collision in less than one day, this one has a 0% chance of collision at one packet per second, at least up until the time that currentTimeMillis wraps around. – hobbs Dec 31 '16 at 05:14
  • 3
    @hobbs You forget about integer overflow. Now if the key the OP used was a structure containing 2 integers, one containing System.currentTimeMillis and one containing Random.makeInt(), then the probability of a collision goes down substantially. However, that is not what the code in this example does. Given any previous time and random value, and any current time, the probability of collision is identical to the probability of two random numbers colliding in the first place. – Cort Ammon Dec 31 '16 at 06:11
  • @CortAmmon I think you missed what my post actually does. I didn't multiply the values, I string concatenated them. So it the logical equivalent of an Int Int tuple. You proved my point. – Fresheyeball Dec 31 '16 at 21:11
  • What does parseToInt do? I had assumed it took a string and parsed it into an int – Cort Ammon Dec 31 '16 at 21:17
  • Yes that's right, but the resulting Int is just an encoding of two distinct Ints. Its functionally the equivalent of a tuple with two Ints, but easier to work with on the type level. – Fresheyeball Dec 31 '16 at 21:55
  • 1
    So what does parseToInt do with a string which describes a number outside of the domain of Int? (or are you using a language here that has an arbitrary precision integer class?) – Cort Ammon Dec 31 '16 at 23:45
  • @CortAmmon its pseudocode. That consideration is language dependant. – Fresheyeball Jan 01 '17 at 07:33
  • 2
    @Fresheyeball That consideration is fundamental to the functioning of the algorithm you provided, given that nearly 100% of all calls to it will be result in a string parsing outside of the domain of int. – Cort Ammon Jan 01 '17 at 07:42
1

Sure, you've got pretty low probabilities of two random 32-bit integers being sequential but it's not completely impossible. The appropriate engineering decision is based on what the consequences of collisions would be, an estimate of the volume of numbers you're generating, the lifetime over which uniqueness is required & what happens if a malicious user starts attempting to cause collisions.

1

built into some of the answers above is the assumption that the random number generator is indeed 'flat' - that the probability of any two numbers being the next one generated is the same.

That's probably not true for most random number generators. Most of which use some high order polynomial repeatedly applied to a seed.

That said, there are many systems out there that depend on this scheme, usually with UUID's. For example, every object and asset in Second Life has a 128 bit UUID, generated randomly, and they rarely collide.

Anniepoo
  • 119
0

It can be acceptable to assume that random numbers will be unique but you have to be careful.

Assuming your random numbers are equally distributed, the probability of a collision is roughly (n2/2)/k where n is the number of random numbers you generate and k is the number of possible values a "random" number can take.

You don't put a number on astronomically unlikely so lets take it as 1 in 230 (roughly on in a billion). Lets further say you generate 230 packets (if each packet represents about a kilobyte of data then this means about a terabyte of total data, large but not unimaginably so). We find we need a random number with at least 289 possible values.

Firstly your random numbers need to be big enough. A 32 bit random number can have at most 232 possible values. For a busy server that is nowhere near high enough.

Secondly your random number generator needs to have a sufficiently large internal state. If your random number generator only has a 32-bit internal state then no matter how big the value you generate from it you will still only get at most 232 possible values.

Thirdly if you need the random numbers to be unique across connections rather than just within a connection your random number generator needs to be well-seeded. This is especially true if your program is restarted frequently.

In general the "regular" random number generators in programming languages are not suitable for such use. The random number generators provided by cryptography libraries generally are.

Peter Green
  • 2,242
0

A lot of people have already given high-quality answers, but I'd like to add a few minor points: first, @nomadictype 's point about the birthday paradox is excellent.

Another point: randomness isn't as straightforward to generate and define as people might ordinary assume. (In fact, there are actually statistical tests for randomness available).

With that said, it's important to be aware of the Gambler's Fallacy, which is a statistical fallacy where people assume that independent events somehow influence each other. Random events are generally statistically independent of each other - i.e. if you randomly generate a "10" it doesn't change your future probability of generating more "10"s in the least. (Maybe someone could come up with an exception to that rule, but I'd expect that that would be the case for pretty much all random number generators).

So my answer is that if you could assume that a sufficiently-long sequence of random numbers were unique, they wouldn't really be random numbers because that would be a clear statistical pattern. Also, it would imply that each new number isn't an independent event because if you generate, for example, a 10 that would mean that the probability of generating any future 10s would be 0% (it couldn't possibly happen), plus that would mean that you'd increase the odds of getting a number other than 10 (i.e. the more numbers you generate, the higher the probability of each of the remaining numbers becomes).

One more thing to consider: the chance of winning the Powerball off of playing a single game is, as I understand it, approximately 1 in 175 million. However, the odds of someone winning are considerably higher than that. You're more interested in the odds of someone "winning" (i.e. Being a duplicate) than in the odds of any particular number "winning"/being a duplicate.

  • If one is generating 4096-bit identifiers in such a way that every bit is equally likely to be 0 or 1 independent of any other bit that has been generated in the same or any other identifier, the probability that any two identifiers would ever match would be vanishingly small even if one were to randomly generate a different identifier for each of the roughly-4.0E81 atoms in the observable universe. The fact that such identifiers would almost certainly be unique would not in any way make them "non-random" – supercat Dec 31 '16 at 01:09
  • @supercat That's true - given a sufficiently large number it's highly unlikely that there will be duplicates, but it's not impossible. It really depends how bad the consequences of non-uniqueness are whether what the OP is describing is a good idea. – EJoshuaS - Stand with Ukraine Dec 31 '16 at 02:32
  • If the probability of a random chance collision is smaller than the probability of a meteor strike obliterating the devices that rely upon the unique ids, from an engineering perspective there's no need to worry about the former. There would be a big need to worry about anything that could cause the random numbers not to be independent, but random collisions would be a non-issue. – supercat Dec 31 '16 at 16:07
  • @supercat I think you're misreading this, see the other answer on the birthday paradox, I think a collision far more likely than you're calculating - the OP's just using a 32-bit number so I'm not sure where you're getting 4096 from, and as nomadictype showed the probability of an eventual collision with a number of that length is actually surprisingly high. – EJoshuaS - Stand with Ukraine Dec 31 '16 at 17:02
  • You're right that a 32-bit number is too short even for small populations if collisions are totally unacceptable. If one uses a number which is sufficiently large one can reduce the probability of random collisions to the point where one can safely assume they Just Won't Happen, and in many cases using a larger number may be better than trying to use other means of ensuring uniqueness, since the latter generally requires having access to state transitions that cannot be undone or rolled back, even if a system's clock is reset or the system is reloaded from a backup. – supercat Dec 31 '16 at 17:13
  • @supercat I agree, especially if having a duplicate's not too catastrophic - it's still important to remember that "highly unlikely" isn't the same as "impossible" though, especially given a large number of "tries". The odds of winning the Lottery even once is 1 in 175 million, and yet people win - a few people have even won several times, apparently. Even if a particular number has a low chance of being a duplicate, that doesn't necessarily mean that there's a low chance of any number being a duplicate (each individual has a low chance of winning the lottery, but someone often wins). – EJoshuaS - Stand with Ukraine Dec 31 '16 at 17:40
  • 175 million is only about 2^27. 2^4096 is an enormous number on a completely different scale. A counter anecdote to the lottery is the idea that most likely, no two people have ever shuffled a deck of cards into the same order (when shuffling properly), and probably never will before humans die out. – whatsisname Dec 31 '16 at 19:43
  • @whatsisname The OP is only using 32-bit number (not a 4096-bit number), which is only a little over 4 million, so I think the analogy works here. – EJoshuaS - Stand with Ukraine Dec 31 '16 at 23:26
  • @EJoshuaS: you were arguing with supercat who had an example of a 4096 bit number. – whatsisname Dec 31 '16 at 23:30
0

It doesn't matter how many bits you use - you CANNOT guarantee that two "random" numbers will be different. Instead, I suggest that you use something like the IP address or other network address of the computer and a sequential number, preferably a HONKIN' BIG sequential number - 128 bits (obviously unsigned) sounds like a good start, but 256 would be better.

-1

No, of course not. Unless the rng you're using samples without replacement, there's a chance - however small - of duplication.