## Introduction

With the rapid growth of using electronic files, manually verifying the content of every file in a file system not only time consuming, but also lead to human-error during checking and therefore infeasible.

In the early days of computing forensics, verifying file integrity has played an important role. As the data stored in a suspected disk is vulnerable and retained for evidential use, forensic specialists are often required to acquire an exact mirror image of a suspect's disk drive for comprehensive examination. For this reason, a strong cryptographic hash function is required which can offer a useful and handy way for an examiner to verify data integrity.

There are several well-known hashing algorithms used in cryptography. These include the
 Message digest hash functions, MD2, MD4, and MD5, which are used for hashing messages into a shorter value called a message digest. The Secure Hash Algorithm (SHA), a standard algorithm, that makes a larger (160-bit) message digest and is similar to MD4.

## Mathematical theory of hash functions

The mathematical theories of hash functions guarantee the following properties:
 If a file F gives a hash value H1, then every single bit of H1 is a function of all bits of F. If a file F gives a hash value H1, then modifying F by a single bit will result in a totally different hash value. If a file F gives a hash value H1, then given another hash value H2 not equal to H1, it is computationally impossible to purposely modify part of F (such as modifying the last 10 bytes) such that the newly modified file will produce H2 as the hash value. The chance of two randomly selected files having the same hash value is extremely small. For example, the chance of two files have the same MD5 hash value, which has 128 bits, will be 1/(2^128), roughly equal to 1/(3.4 X 10^38), or roughly the chance of one in 340 billion billion billion billion. When compared with real life scenarios: the published chance of winning first prize in the Hong Kong Mark Six (the lotto game in Hong Kong which randomly pick 6 number from 1 to 47) is one in 10,737,573, and the published probability of winning the United States Pennsylvania Super 6 Lotto is one in 39 millions. Therefore, the chance of having two files with the same MD5 hash values is similar to the chance of winning 30,000 billion billion billion Hong Kong Mark Six first prize. The chance of two files having identical SHA-1 hash values is even smaller since a SHA-1 have value has 160 bits.

## Cryptographic Hash Algorithms: MD5 and SHA1

MD5
MD5 is a message digest algorithm that was developed by Professor Ronald L. Rivest of MIT. MD5 hash algorithm is used to verify data integrity, it takes as input a message of arbitrary length, and produces as output a 128-bit "fingerprint" or "message digest" of the input. It is conjectured that it is computationally infeasible to produce two messages having the same message digest, or to produce any message having a given prespecified target message digest.

The MD5 algorithm is designed to be quite fast on 32-bit machines. In addition, the MD5 algorithm does not require any large substitution tables; the algorithm can be coded quite compactly. The MD5 algorithm is an extension of the MD4 message-digest algorithm. MD5 is slightly slower than MD4, but is more "conservative" in design. MD5 was designed because it was felt that MD4 was perhaps being adopted for use more quickly than justified by the existing critical review; because MD4 was designed to be exceptionally fast, it is "at the edge" in terms of risking successful cryptanalytic attack.

For more detailed description on MD5, please refer to http://www.ietf.org/rfc/rfc1321.txt?number=1321.

SHA1
The Secure Hash Algorithm (SHA) was developed by NIST and is specified in the Secure Hash Standard (SHS, FIPS 180). Although slower than MD5, this larger digest size makes it stronger against brute force attacks. The Secure Hash Algorithm takes a message of less than 264 bits in length and produces a 160-bit message digest which is designed so that it should be computationally expensive to find a text which matches a given hash.

The SHA-1 is called secure because it is computationally infeasible to find a message which corresponds to a given message digest, or to find two different messages which produce the same message digest.

For more detailed description on SHA-1, please refer to http://www.ietf.org/rfc/rfc3174.txt?number=3174

## Hash Values Comparison Example

File Integrity Checking Application The following scenario and figure illustrate the most common way on how to use the uniqueness hash value to verify if files have been modified.

Scenario:

1. An officer has built a set of hash values for Folder A and save this set of values in a database (Hash Set A).

2. After a few months, he would like to verify if any unauthorized personnel have tampered his files.

3. He uses the same hash function to build a new hash set (Hash Set B) for Folder A again and use a file integrity tool to do a comparison between these two hash sets (Hash Set A and Hash Set B).

4. The mismatched results are reported after the comparison process.

figure 1: Hash Values Comparison Example