Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Birthday paradox is way too important to omit from this list!

For example if you're uploading stuff to a bucket, you can compute its hash first to figure out if a duplicate already exists and if so, skip the upload.

Why can you do this? What if it was just a hash collision? Shouldn't you still compare the contents to really make sure they are the same?

Turns out if your hash function is N bits you will need to have 2^(N/2) items before you see two hashed to the same thing by chance. If you choose a 256 bit cryptographic hash function like SHA256, that's 2^128. This probability is so low you have a higher chance of encountering a cosmic ray bit flip!

https://en.wikipedia.org/wiki/Content-addressable_storage

https://en.wikipedia.org/wiki/Birthday_problem



I don't quite understand how the birthday paradox relates to your example. Your example is how a hash collision is a very low probability event. But the birthday paradox is typically stated as odds being much higher than you expect.

The hash collision problem is a birthday paradox, I suppose, but the probabilities are so low that that part isn't stressed in your post.

I feel like a better example of the Birthday Paradox would be something like:

I automatically give all my users a 5-character random ID. I figure there are nearly 12 million (26^5) combinations, so I should easily be able to support a hundred thousand users, since (naively) I calculate that the probability of the 1001st user matching any existing user is less than 1%. [1]

But if you calculate it correctly has the birthday paradox, you'll see that the probability of having a collision given 100,000 users is ~100%. [2]

[1] 1 - (1-(1/11881376))^100000 < 0.01

[2] 1 - (((11881376-1)/11881376)^((100000 * (100000-1))/2)) = 1


> Turns out if your hash function is N bits you will need to have 2^(N/2) items before see two hashed to the same thing by chance.

No no no. That's not how probability works. You will see that many on average, but you don't "need to" anything before you can see a collision. You could get collisions on the next 100 files. It's just unlikely. The bitspace bounds the denominator of match probability for each file independently, it does not count files. Unlikely events happen all the time somewhere.


> No no no. That's not how probability works.

Pretty sure GP understands how probability works, maybe take a more charitable interpretation of their comment?


> Pretty sure GP understands how probability works, maybe take a more charitable interpretation of their comment?

There are many circumstances where I fully agree with charitable interpretation, but this is not one of them. In this case it's tantamount to saying that it's ok to mislead people because they should just know better. (especially in light of their reply)

If you are correct about what they understand, then I'd rather they were more charitable to the other people who don't already understand it by not writing things that are wrong and potentially dangerous. Someone out there is going to learn about this by googling, see 19 out of 20 people get it wrong, implement a system that relies on the majority false information, and then destroy the lives of a bunch of innocent people. It may not be someone you know, and it may not impact your neighbors, but it happens _all_ the time. I care about the harm of spreading misinformation, and I think you should as well.


The precise statement is that you need to have O(2^(N/2)) items before the probability of finding a hash collision is greater than 50% (or whatever nontrivial percentage). English is hard and I think whoever cares about the details can look it up themselves.


> I think whoever cares about the details can look it up themselves

Your statement was an incredibly common and often repeated misperception of probability and circumstance. Someone looking it up for themselves, as you suggest, is more likely than not to encounter dozens of wrong statements on the subject before they ever find a right one. I think it behooves us to not add more misinformation to the pile.


The precise statement should be Omega, not O, since you're doing a lower-bound.


"This probability is so low you have a higher chance of encountering a cosmic ray bit flip!"

We commonly conceive of our computers as 100% accurate, but as you observe here, for this and other reasons they aren't. The distribution of errors is exceedingly pathological; the vast majority of errors you will encounter are not independent but are highly correlated, because some particular bit of hardware is flawed in some manner.

Ignore those for a moment, and consider just the purely-random failures, like cosmic or thermal bit flips. The rate on these is still non-zero. Very small, but high enough that we've pretty much all encountered them, usually without realizing it. (While it is true that a single bit flip can bring a program down if it is the correct bit, the average bit flip will have no visible manifestation of any kind.)

This creates a "noise floor" for our computations. Any event which has a probability lower than this "noise floor" for happening can be treated as the same zero probability you treat the possibility of totally random hardware failure.

The probably of two random pieces of content having the same SHA256 hash by random chance is in practice zero, and you may write your code that way. The potential problem that one may wish to defend against is the possibility that two pieces of content have the same SHA256 hash for non-random reasons, which is to say, the possibility that it will be broken. But the defense against that is rather different. There's a lot of nasty possibilities that lie above this noise floor that still need to be dealt with.

I have seen this concept kinda bothers people sometimes. But you are justified in just ignoring anything below this noise floor. There's basically never a reason to worry about this noise floor, because once you understand what it really means, you will see you lack to the tools to deal with it. Given an error, the probability that the error is the result of some non-random systemic issue is much higher than the probability that it was a truly random error. Even if you care about reliability a lot, like in space or medicine, the problem you need to deal with is failing hardware. If you try to write code to deal with the noise floor, it is much more likely to be getting invoked due to systemic, non-random issues... after all, a "low probability" failure on a network link that corrupts one packet a day is still multiple orders of magnitude above this noise floor.


Also, if you choose an unbroken cryptographic hash like SHA256, and you stumble on a collision, you can later publish it and become famous.

(Of course, you'd need enough logging and debugging to be able to reproduce that.)


Flipped bits are not rare at all. Flipped bits are very common and your computer (or datacenter) has so many transistors that I assure you, bits are flipping all over the place.


In fact, this is exactly why things like ECC memory exist -- https://en.wikipedia.org/wiki/ECC_memory


If you are hashing k unique elements, N=2lg(k) is a good rule of thumb for sizing your hash functions. A Birthday Attack [1] analysis provides the exact collision probabilities for arbitrary (k, N).

[1] https://en.m.wikipedia.org/wiki/Birthday_attack


The most useful formula I’ve seen related to the birthday problem is the approximation p = (n^2)/(2m), where n is the number of random choices made, m is the number of choices available, and p is the probability of at least one collision. You can derive the usual formula for the number of hash bits necessary to avoid a collision with probability 1/2 (given a set of cardinality n) by just substituting 1/2 for p and solving for m. But this formula is much more useful than that special case since it lets you set the collision probability to be arbitrarily small.


Sure, the probability of a collision is exceedingly low if you use SHA256. And if you take care of hashing _the whole content_.

But in many programming applications (e.g. hash maps) you wouldn't use a cryptographic hash because it's too slow. Then it becomes important to design a hash function that is "good enough" and, unfortunately, it's quite easy to implement "hashCode" (or whatever your programming language calls it) in a broken way for your types and suddenly you have collisions. I've seen exactly this problem in my own code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: