18

I want to SHA256 hash phone numbers in order to hide them. Is this a good idea? Are there any other ways I could make this safe?

Jack Resone
  • 297
  • 1
  • 2
  • 4
  • 10
    What are you doing with the phone numbers that you need to hide them? – Aman Grewal Jun 30 '20 at 19:28
  • 7
    Remark (not an answer): there's a special thing with phone numbers. Typically, adding a suffix leaves the number functional. Sometime adding a space or other character, or some prefix, will do the same. E.g. where I live, 0123456789 is the same as 01234567890, 01234567891, 01 23 45 67 89, +33123456789. That makes blacklists based on phone numbers (hashed, encrypted, whatever) easy to circumvent unless there is a strong normalization of the phone numbers, and that's not easy. – fgrieu Jul 01 '20 at 07:21
  • @fgrieu: do you have some sources for that? 0123456789 is OK. It is not the same as 01234567891 (with an extra 1). Depending on the provider you get different answers (Orange will say that the number is not attributed). The spaces are irrelevant (but I think you mean you can traditionally write a correct number in different ways - in that case 01.23.45.67.89 is common in France as well). – WoJ Jul 01 '20 at 09:51
  • 1
    What is your actual problem? When someone is trying to hide a phone number this indicates they need more to hide. The articles below the answer How can frequency analysis be applied to modern ciphers? might be helpful to your real cause. – kelalaka Jul 01 '20 at 10:47
  • @WoJ: Source is experiments. When I dial my mobile number (Orange) on my fixed phone (OVH) adding an extra digit, or vice versa, the call goes thru. When a server I run sends automated SMS (thru OVH), same. On this, I have witnessed users registering multiple times with the same mobile phone by adding such suffixes, and trying to do so by adding spaces or dots (these ones are easy to filter). Since sending an SMS has a cost, and one of the intend of sending a SMS is to limit multiple registrations, that's an issue. There is no catch-all solution to tell when digits become insignificant. – fgrieu Jul 01 '20 at 11:01
  • Does the adversary know they're looking at the hash of a phone number? If not, it's a bit of a different story, although I can't think of a situation that would prompt one to ask this question where the fact that the hashed data is a phone number wouldn't be known or easily discoverable. – David Z Jul 01 '20 at 11:36
  • 4
    @David Z: Kerckhoffs's (second) principle or Shannon's «the enemy knows the system» makes it necessary to assume adversaries know what you hash. Plus, the popular use of phone numbers as passwords makes it plausible they are tested when attacking a password database. – fgrieu Jul 01 '20 at 11:47
  • 1
    @fgrieu: this is really interesting. I just tried to call my 01 number from my mobile (Orange), adding an extra digit and the call was rejected ("number not assigned"). I then tried to call the mobile + extra digit from my company phone (a 01...) and the call was straight rejected (probably because of the filtering at my company, which would reject a "malformed" number). Finally, I called the mobile from my fixed line at home (with the extra digit) and, tadam, it works. It may be that mobile numbers are more forgiving - anyway thanks for the TIL. – WoJ Jul 01 '20 at 11:51
  • 1
    @WoJ It has more to do with how the call is initiated. Historically in the US, you could advertise your phone number with extra digits (for example, you chose your number because it would spell out ACME FOODS or something.). Your phone number is really 226-3366, but a customer could dial 226336637 because the exchange would simply stop "listening" after the final 6; the 3 and 7 simply weren't part of the number as dialed. – chepner Jul 01 '20 at 12:42
  • 9
  • 1
    @fgrieu I'm familiar with the principle, but I made the point because of a situation where, say, there are a bunch of hashes of assorted data, some of which happen to be phone numbers. The attacker might not know which ones. (If you assume otherwise, that goes beyond Kerchoff's principle and starts to approach "the attacker knows the 'plaintext'".) Basically there is some missing information about the threat model which I thought would be useful to clarify explicitly. But you do make a good point about phone numbers being early on the list of brute force attempts. – David Z Jul 01 '20 at 18:04
  • Hashing for added security - yeah, why not. Relying on hash only for security - nope. Reasons are elaborated in the responses. – Zsolt Szilagy Jul 03 '20 at 12:21

5 Answers5

35

No, it is not a good idea to hash phone numbers. There are only a limited number of phone numbers, so it is pretty easy for an adversary to try and hash all of them. Then you can simply compare the hash of each with the stored hash. Generally you don't have to deal with all telephone numbers, only a subsection of phone numbers anyway (for a specific country or other group that is logically distinct).

You could use a slow password hash with salt and work factor, but that's only going to mean that time time required is multiplied by a large, constant value. It won't change the order of operations. If the subset is small enough it may not deter an adversary to perform all necessary calculations.

In this case you will probably need to encrypt the phone numbers instead. Or use a keyed hash such as HMAC. For both options you need to perform key management on the secret key though; it's not as easy as just hashing the number.

Maarten Bodewes
  • 92,551
  • 13
  • 161
  • 313
  • 7
    Limited is not good for quantify. $2^{256}$ is also limited. I would rather say there are at most $10^{10}$ phone numbers ($\approx 2^{34}$), and actually less, ABC-XYZ-XXYY where usually ABC less than 10^{3}. – kelalaka Jun 30 '20 at 19:50
  • 2
    Yeah, only a few billion, you don't really need too much calculations for that. And it could be that you just want to check if a few phone numbers are in there, in that case you might be able to count them on your fingers. – Maarten Bodewes Jun 30 '20 at 20:24
  • 1
    @kelalaka The (significant) number space for phone numbers is 15 digits long. You are just looking at some local part of that. So it's more like 2^50... - The rest of your comment is valid, of course. – I'm with Monica Jul 01 '20 at 07:14
  • 4
    @I'mwithMonica I'm not sure you can even say that much; libphonenumber claims some German numbers exceed the ITU-T 15-digit standard and some systems allow you to automatically dial internal extensions, which could be any length. On the flip side, an attacker knowing that the target number is in a particular region might have as few as 5 digits to guess. I don't think it's really meaningful to talk about the possible range of numbers without knowing more context of the application. – IMSoP Jul 01 '20 at 11:03
  • 1
    @I'mwithMonica 2^50 seems like a lot. I dare say that humans have less than 10 phone numbers on average : that would be 2^36 at most. – Eric Duminil Jul 01 '20 at 15:07
  • @EricDuminil You're right, but as an attacker you'd neet to know which 2^36 out of the range of 2^50 theoretically valid phone numbers those are. - Getting all (previously unknown) phone numbers for all hashes in a given database will thus be a problem. But just verifying a specific phone number - or set of specific numbers - is in the database (and fetch the matching record(s)) would be a lot easier. – I'm with Monica Jul 01 '20 at 20:52
  • 2
    @I'mwithMonica: Keep in mind that even if the total space of all possible phone numbers is quite large, an attacker may still be able to accomplish quite a lot by just assuming "some local part of that" space (especially since they'll know enough about the application to be able to guess which local part). If they manage to crack only half the phone numbers in the set, that's still probably a huge problem. – ruakh Jul 01 '20 at 21:44
  • Can you explain why encryption is a useful alternative? If OP can securely store the key, then they could have securely stored the hashes, so what is the difference? In either scenario if the data (key / hashes) is leaked then there is a problem. – JBentley Jul 02 '20 at 08:55
  • There is not much of a difference, if somebody gets hold of the key then the telephone number is relatively easy to find in both cases. The problem is of course that both require you to have a secret key and the key management that comes with it. The issue that it requires a key is the main one; after that there are any number of methods to implement the protection of the number. Maybe I should have written "protection with secret" but most people would have assumed encryption in that case as well. I've adjusted the answer though. – Maarten Bodewes Jul 02 '20 at 09:08
  • @IMSoP: is right; when I worked for Lucent one of the big projects was to push the maximum number length from 18 to 35. And this was already back in 1999. – MSalters Jul 02 '20 at 14:54
  • Just like with IPv6 I presume that this is not because the number space isn't large enough for the total numbers, but because there is a lot of logical division going on. I mean, earth population is large, but we're not anywhere near 10^35 (and we are dang unlikely to reach that number without destroying it, as if the current population isn't enough for that). That also means an adversary doesn't have to try and hash that many numbers of course. – Maarten Bodewes Jul 02 '20 at 15:51
  • @MaartenBodewes Your conclusion doesn't follow from your premise - the number allocated may be vastly smaller than the number possible with a given length, but that fact is only useful to an adversary if they have perfect knowledge of which numbers are allocated. Otherwise, it's like saying they don't need to try every password, because most strings have never been used as a password. There might be an equivalent of a dictionary attack using publicly listed phone numbers, but as I said before, it's impossible to comment on that without context which this question doesn't provide. – IMSoP Jul 03 '20 at 08:55
  • A password doesn't (or rather shouldn't) have many distinguishing properties. Yes, without context we're kind of stuck, but when I mull this over I can find more and more situations where confirming a guess is important, or where finding a set of phone numbers from a larger set would be interesting (collisions are easier to find due to the birthday problem). There may be situations where this is not the case, but even then the total number of existing phone numbers is relatively low. There may be on or more cases where a password hash could suffice - but generally it won't. – Maarten Bodewes Jul 03 '20 at 09:29
  • @MaartenBodewes I think there are two separate weaknesses here: one is that a dictionary attack may be more useful against a list of phone number hashes than a list of password hashes because phone numbers may be more predictable; the other is that a brute force attack (not using any dictionary) may be feasible because most phone numbers will be limited in length. Calculating all possible phone numbers can probably be made infeasible, because there are at least 10^15 of them; calculating a useful dictionary to attack a particular use case is harder to defend against. – IMSoP Jul 03 '20 at 11:33
13

It is always a bad idea to hash data that has a limited set of length or characters.

A phone number in Germany for example has normally no more than 12 digits. The first digit is always a 0 and the vast majority of numbers is longer as 3 digits, as those are normally reserved for emergency services.

This effectively leaves us with 10^11-10^3 possible combinations. The amount of time required for brute forcing this amount of combinations greatly depends on the used algorithm.

When using MD5 which is absolutely insecure to use nowadays, cracked by 8x Nvidia GTX 1080's and Hashcat, this is done in less than 10 minutes. Unfortunately, according to my experience, there are still thousands if not millions of services, hashing even passwords with insecure algorithms.

E.g. when using bcrypt, you could slow this down by a factor of more than 2000, however, this would still be incredibly insecure. And normally, the cost needs to be set according to the backend perfomance requirements as well. If an attacker could guess the location of the phone numbers to crack, it would be a matter of seconds.

You have the same problem when trying to hash IP addresses, it's also not a secure way to hide the plaintext.

dmuensterer
  • 378
  • 1
  • 8
  • 3
    Phone numbers in Germany do not always start with 01. That's only the case for mobile phone numbers. The numbers 2 through 9 are all used by area codes for land lines. There's a lot of detail in https://en.wikipedia.org/wiki/List_of_dialling_codes_in_Germany – Nzall Jul 01 '20 at 12:48
  • @Nzall You are obviously correct. I only thought of mobile numbers! Still, quite a small set :-) – dmuensterer Jul 01 '20 at 13:26
  • Phone numbers in Germany can, actually, be 3 digits long. They are either emergency services, then, or only locally dialable, though. But there are local area nets with (usually legacy) three digit phone numbers. – I'm with Monica Jul 01 '20 at 20:57
  • 2
    A tenfold increase is a serious underestimate. I can compute SHA-512 at 36 MHash/s on my CPU, taking about four minutes to brute-force all possible phone numbers. – Mark Jul 01 '20 at 21:40
  • bcrypt is adaptive and can be made as slow as you like making that 9 day estimation very suspect. The substance of the answer remains correct. – Schwern Jul 02 '20 at 16:53
  • Thank you for your answers! I clarified my points for better understanding. @Schwern, could you please provide a source for that statement? I was referring to the following hashcat benchmark: https://gist.github.com/epixoip/a83d38f412b4737e99bbef804a270c40 – dmuensterer Jul 02 '20 at 17:19
  • 1
    @dmuensterer That's how bcrypt works, you must give it a cost setting. See https://gist.github.com/epixoip/a83d38f412b4737e99bbef804a270c40#gistcomment-1796273 and https://gist.github.com/epixoip/a83d38f412b4737e99bbef804a270c40#gistcomment-1796418 in that thread. – Schwern Jul 02 '20 at 17:23
  • @Schwern Great addition, thank you. – dmuensterer Jul 02 '20 at 17:30
13

In the general sense, The problem is known as the small input space on the hash functions, and in short simple hashing won't be secure.

If you hash data ( here a phone number) and an attacker tries to find an input value that matches the hash value is called the pre-image attack. In a secure Cryptographic hash functions pre-image attack requires $\mathcal{O}(2^n)$-time where the $n$ is the output size of the hash function and in SHA256 $n=256$

If the input space is small, this gives an attacker a huge boost, that is they can only brute force the small space. If 10 digit phone numbers are stored then the attackers need to search only $\approx 2^{34}$-space and if 15 that can make only $\approx 2^{50}$. Even the last space is highly achievable with a good GPU, see the hashcat performance. Therefore one needs either a way to slow the attacker or make it harder.

  • To make the attack slower, slow and memory-hard hash functions can be preferred like the Scrypt or Argon2id. This amount can be adjusted according to the target's capabilities. For example, using 100K iteration will slow the attacker time 100K or will reduce their search space capabilities within a limited time approx by $2^{16}$. As upper computing power, the collective power of the Bitcoin miners can reach $\approx 2^{92}$ double SHA256 in a year. If your enemy has this power slowing will not help much.

    Another choice is using salt per data as stated in the answers together with slow and memory-hard hash functions. This will only slow the attack time and prevent pre-computed tables like the rainbow tables. The attacker's execution time will increase by the number of the target hashes.

  • To make it harder, HMAC can be preferred, This is a keyed hash function and can be initialized with SHA256, too. The attacker without the key has no luck to attack the hash value. Another way is encryption. Although the phone numbers are should be unique, if one uses ECB mode that can be used to mount some attack to identify the number. The attackers can register and enter a target phone number as their phone number to identify the target position on the database. Therefore, an Ind-CPA secure mode should be preferred like CBC or CTR.

Both HMAC and Encryption have an additional problem to be solved. The storage of the keys. For this Hardware Secure Modules (HSM) can be preferred. The keys cannot be extracted from the modules and the HMAC and Encryption can be performed over these devices. If the attackers access the application server that uses the HSM the only hope is that they have limited access to use the HSM as a slave.

Conclusion: Use encryption or HMAC. If one fears of the loss or access of the keys use HSM to store and execute the Encryption/HMAC on HSMs.

kelalaka
  • 48,443
  • 11
  • 116
  • 196
7

As an alternative, you can salt the phone numbers to avoid pre-calculation attacks.

A known salt will help against an adversary who has already done a hash of all possible phone numbers but just adds one order of magnitude of work (the adversary just has to recalculate all the hashs with the salted phone numbers).

If you can keep the salt private raises the bar on brute force attacks (essentially you are adding the salt's bits of entropy to the entropy inherent in the phone numbers).

Kelly Trinh
  • 179
  • 2
  • 5
    Yes, this will avoid pre-calculation attacks using a salt, but this will not change the order of operations for brute forcing a limited set. And if you are applying a salt then using a password hash / PBKDF with an additional work factor adds to the complexity. – Maarten Bodewes Jul 01 '20 at 07:38
  • As @MaartenBodewes already correctly stated, this won't make it much more secure because a limited, short set is always very easy to crack. – dmuensterer Jul 01 '20 at 10:41
  • 5
    If you can keep the salt private you can just keep the hash private. – user253751 Jul 01 '20 at 10:46
  • 7
    A secret salt is called a "pepper". Of course, you might as well encrypt with a secret key instead of using a secret pepper if you can guess the telephone number by trying all options once you have the secret. But yeah, it can be used, and a pepper might be agood idea. – Maarten Bodewes Jul 01 '20 at 11:39
  • @dmuensterer depending on a length of a salt. First, salt is random different string for each phone number hashed (if it is secret fixed one, it's called pepper, as Maarten notes). So with good enough RNG and big enough salt, you can make it secure against precomputed rainbow tables (but not against brute forcing). Still, if they need them, it's better than storing them in plaintext, especially if there is a lot of them... – Matija Nalis Jul 01 '20 at 12:18
  • @Matija Nalis A rainbow is not needed because the set is so small that it is a piece of cake to brute force every single one. – dmuensterer Jul 01 '20 at 13:26
  • @dmuensterer true enough if he went for single-pass SHA1 or something similarly weak. As mentioned in other comments, using some good KDF (eg. argon2, scrypt or whatever is actual this days) to increase per-attempt complexity significantly. – Matija Nalis Jul 01 '20 at 13:51
3

An alternative is to encrypt the phone number as proposed in the previous answers. For example, Mobile connect identity service encrypts the MSISDN (aka phone number) using a specific algorithm.

This GSMA specification gives information about decoding the payload :

Following are the example of encrypted MSISDN passed:

  • with URL encoding:

      login_hint=ENCR_MSISDN%3A0bb3020c7758f34e012da3f0bf13dc7674b3a9527
      6e804388d5aae4a034fe442a65e03027d0651da3b0646df6c11d3c5d6f46879480b
      623bd5024d9e0879727f46fbd1e8f5383a115678ea638a4ba5399a2dd37138246e
      db06718bb44be98f5331a1331902d6333993642e2f25197961ee0b0a14ddf66083
      4d49f7f385d82cad5a12003cd8aa235a92b71589110d76df382eab80b12a8dfa6d0
      5b4ca548538ac4b09a2868448957604eb52b1ceecc89dfe836e7113e51645c2a14f
      ff900228a8475983435647e88552a96eb692685b12abfc7ae0ad2bc23d30b3c8d82
      8ca101e186455b4d618a8c9022662ee1c5b8ffea40defdb92a20dce39bdbedcbf78
      5a2e
    
  • The serving operator recognises the input of the encrypted MSISDN and decodes the base64 encoded data.

  • The serving operator applies their private key to decode the RSA coded data.

  • The decrypted string looks like 441234567890|dasd23231139dskdeirirewr0234043ekewrwe4034c.

  • The serving operator then extracts the initial (numeric) portion of the decrypted data as the MSISDN separated by (|) pipe and uses this for any relevant purpose in API services/user sign-in.

In this case, it uses RSA encryption, and the private key is only known by the Mobile Network Operator

So an implementation would be :

base64(encrypt(44123456789|randomString))