The Pure Crypto Hash
Remarks on Security

The following hash function is based on a user's public key modulus and on the basic function ModExp(A,B,C) which calculates A**B mod C, creating a 256 bit hash value and using ModExp() on every single character in the message text as a round function.

   LargePrime=106603488380168454820927220360012878679207958575989291522270608237193062808643
   Ho = Modulus mod LargePrime
   Hi+1 = (Hi + 19) XOR ModExp((Hi + Ho), (ord(character i) + 19) , LargePrime)
As the basic function ModExp() is multiplicative, the hash function has to be designed to be multiplication free (see [And-93]), i.e it should be computational infeasible to find any x, y, z so that hash(x) * hash(y) equals hash(z). Therefore ModExp() will be used in feed-forward mode and its output is XORed with the previous hash value, which depends on the chaining-variable Hi and one single character of the message. Adding the small prime to the new hash will contribute to breaking the multiplicative homomorphism of ModExp(). To avoid modular exponentiation with a zero exponent (which would result in a chaining-variable of 1) a small prime (currently 19) is introduced to increase the exponent accordingly. In the next round the new hash value will be the base for exponentiation with the following character as an exponent. As it is inevitable that with a very small probability the output of ModExp() is either 0 or 1, in both cases the next character being processed would be ignored. Including the lower 256 bits of the public key modulus in the base of the exponentiation will prevent such an interruption of the expontentation chain.

The hash function above can be coded with very few lines in the Python language

def ModExp (Base, Exp, Mod) : Hash = 1 X = Exp Factor = Base while X > 0 : Remainder = X % 2 X = X / 2 if Remainder == 1: Hash = Hash * Factor % Mod Factor = Factor * Factor % Mod return Hash def create_hash(Message, Modulus) : Prime = 1066034883801684548209272203600128786792079585759892915222706082 37193062808643L # second factor from RSA-155 challenge , factored Aug. 1999 SmallPrime = 19 LastHash = 0L Hash = HashModulus = Modulus % Prime for Character in Message : LastHash = Hash + SmallPrime Power = ModExp(Hash+HashModulus, ord(Character)+SmallPrime, Prime) Hash = (Power ^ LastHash) % Prime return Hash

One-wayness

A hash function has to be a one-way function, so that you can easily get the pair (X, hash(X)) if you start with X , but not if you start with hash(X).

What would it take to make the hash invertible?

For very small plaintexts of only a few characters it would simply be more efficient to start an exhaustive search on the plaintext than to calculate roots mod prime to recover the chaining-variable Hi. During the processing of every single character an amount of information is thrown away while calculating the modular exponentiation and mixing the result with the hashvalue of the previous round will make it computationally infeasible to recover the chaining-variables from the final output, given the length of a 256 bit hashvalue.

Collision freedom

A hash function should be collision-free, so that it is computationally infeasible to find two different inputs X and Y that hash to the same value (hash(X) = hash(Y)). If X is the text used for a digital signature Y is called a preimage of hash(X), and could be used to forge the signature.

How hard is it to find a preimage for a given hash value?

Firstly, let us look at the effects of a possible manipulation of input bits across the border between two adjacent characters. A manipulation of a single character cannot be compensated by another in a deliberate way, because the alteration of the first character will give a different remainder (mod prime) which is not only masked with the old chaining variable but also the result is used as the new base in the next round, propagating this change through the rest of the chain. It would be easier to try a compensation with the following character than using anyone further down the road, but as the opponent knows Hi she can calculate the effects of an increased or decreased exponent, but then she has to compensate this effect using the new base with a different exponent for which she has only 255 different choices, which reduces the probability of a successful compensation (the new message should give the same hash value as the old one) drastically.

But is there a possible compensation by toggling some bits inside a given character?

Changing bits inside a character will result in an additional factor basek (where k is a small integer smaller than 256) which will not be devided by the prime used in the hash function. Because of the fact that the base of the exponentiation is always smaller than twice the prime there are no common factors and the additional factor cannot be a multiple of the prime. Thus the additional factor can never result in the same output of ModExp() and will always lead to a different hash value. The same argument holds for bit manipulations making the exponent smaller. As the new and old outputs of ModExp() differ only by a factor (basek), the production of the same reminder would require that the prime divides the factor (basek) which is impossible because the base is smaller than twice the prime.

Pseudo-collisions

Inside the basic function ModExp() a small prime (currently 19) is used to prevent zero exponents that would otherwise ruin the hash chain when binary data is hashed. So the small prime defines a lower boundary for the exponents used in ModExp() making it possible to calculate small roots mod prime for selected chaining variables. If it was possible to find a different chaining variable X' that produces the same output (X19 mod prime) of ModExp() with the small prime as the exponent, one could substitute the current chaining variable X with X' pretending that up to the point the zero character appears in the message a different message M' will lead to the new chaining variable X' which serves as a pseudo-collision for the rest of the message. But even if such a pseudo-collision X' was found XORing X' with the output of ModExp() would not produce the same hash value as the original chaining variable X.

Clustering of hash values

Will every possible output value of the pure hash function (0 < hash < prime) appear as a hash value or will hash values cluster and reduce the unpredictability of the hashes ?

Giving a satisfying answer to this question is almost impossible as significant statistical analysis of the hash function's behaviour regarding the clustering of its output values cannot be performed due to the sheer amount of possible hash values. Although it is questionable whether one can extrapolate empirical results that have been found using much smaller primes to reduce the amount of hash values to the full 256 bit hash space, some of the following observations indicate the hash function's ability to scatter its output values:

After having performed a series of statistical tests on the hash function with smaller primes and different input material I will try to draw conclusions out of the results hoping that they shed some light on the quality of the hash function. I would like to present the details of my tests in a separate paper explaining only the most significant results here.

To reduce the amount of hash values small primes (31, 97, 997, 9973 and 32999) were used and the input data was taken from an entropy file made out of 1110000 bytes from /dev/random. Every 10 bytes were hashed and the number of collisions for each hash value was recorded. The data showed that there were some hash values that attracted more collisions than the average ones but the ones who attracted more collisions exceeded the average ones by a factor of 3 to 4 only. Thus the most probable hash value was no more than 4 times more likely to have a collision than the majority of others. The number of hash values that never got a hit grew bigger the larger the prime and consequently the smaller the average value of collisions became. With the prime 32999 and 111000 random input texts 1,5 percent of all hash values did never appear while an average number of 3,2 collisions was found on most of the hashes.

The second result revealed from larger input texts. Always using the prime 32999 I found that with over a million inputs of 10 bytes random data the maximum of collisions was only twice as high as the average while with only 111000 inputs the maximum was 4 times the average. One would expect that with more and more data fed into the hash function the difference between the maximum and the average number of collisions will slowly disappear. Using ordinary text files taken from the manual section of Linux as inputs I produced almost the same results (for around 150000 inputs) with 80 character text strings as with 10 bytes of random data.

These observations are preliminary and should be taken with great caution but for now I do not see any sign that the hash function does not scatter its output values properly.

The Dependence on the Public Key Modulus

Usually a hash function depends on the input text only, but the pure crypto hash does depend on the user's public key modulus as well. On the one hand the modulus is used instead of a fixed initial vector and on the other hand it is added to the chaining variable forming the base of the exponentiation in each round. In PCP the hash function is always used in context to a user's public key either to create a signature or to provide a chain of challenges for pre-encryption padding of plaintext. Because of the fact that a hash function is only meaningful in connection with a particular public key, its modulus can be used to create the hash value. If a signature has to be verified or a ciphertext has to be decrypted a public key always is involved that could be part of the hashing process as well. The use of the modulus bears the further advantage that it makes birthday attacks much more difficult, because finding a pair of different texts that hash to the same value would not only require a lot of precomputation but would also work for one particular public key only so that the whole process has to be repeated for every other public key.

The fact, that by chance two different messages (m1 and m2) can hash to the same value when hashed under two different public key moduli cannot be exploited for cheating as long as the correct public keys were used for verification or decryption. Talking about the hash of a message is meaningless unless you know which key the hash was created for, so we note hash(message, key) instead of hash(message) in PCP. There is another advantage of the hash function's dependence on a public key modulus that comes into play when we look at the way in which the secret key is protected for storage on the local computer system (see the relevant chapter for details).

PCP uses two different secret keys, one for signing (a long-term high security key) and one for encryption (which can be considerably shorter than the signing key) and even if you use the same passphrase to protect them the different moduli will assure that each secret key will be protected with a different storage key, because hash(passphrase, signingkey) differs from hash(passphrase, encryptionkey).

References

[ And-93 ]
Ross Anderson : The Classification of Hash Functions
in: Codes and Ciphers (proceedings of fourth IMA Conference on Cryptography and Coding, December 1093), published by IMA (1995) pp 83-93