Hash functions

A gentle introduction to the backbone of data integrity.

Published May 4, 2026in Cryptography12 min read

There is an operation that computers like do 1000s of times a minute, yet most of us don’t know how it works. It’s an operations that is used

in all sorts of data structures. Most important of which is a hash table
every time you download a file (checksums)
on every block chain
when you try to login to something using your password
for generating pesudorandom numbers

I am talking about hash functions!

Despite the abundance of its use, you may not have heard of hash functions. So, I embark on a goal to make hash functions the first thing you think of when some one says hash and not hash tag or hash brown (a nerd can dream).

Definition

A hash function is a function that takes in arbitrary length input and produce fixed length output, often called the hash value or digest.

We will focus on a specific type of hash functions which have the following additional constraint - the function should always return the same hash value for the same input.

Let’s see an example with a popular hash function called SHA-256:

Hello World! → 7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069

Hello World! This is a longer sentence. → bd7f28d225df2728aa4964dfcad05ef955513972fed1e187e41d1275e67884a4

Cryptographic hash functions

These are 2 key properties that make a cryptographic hash function i.e. secure to be used for data integrity perspective.

1. Collision-resistance

It is “difficult” to find 2 inputs that map to the same hash value.

Defining what is “difficult” is extremely important! There are many different types of “difficult” in cryptography. For this article, we will difficulty as being computationally difficult. Meaning that trying to find a collision would probabilistically (since we can always get lucky) require too many computations (e.g. 2^128 operations).

This is a key property required by hash tables because when two keys/inputs have the same hash, that is called a collision and collisions are not good because we may have to move existing values in the hash tables.

2. Pre-image resistance

Formally, it means that suppose given some hash value y, then it is difficult to find an input x such that its hash value is y. We can intuitively think of this property as the “one-wayness” property as it says that it is easy to go from input to output but difficult to go from output to input (the reverse direction).

While not a direct requirement, we can intuitively think of these properties as being encompassed by the “avalanche” effect, where we want any change in the input to produce a completely different hash value such that the relationship between the input and its hash value is complex.

Let’s hash out the details

Let’s construct a new hash function, which is roughly based on the SHA-2 family (e.g SHA 256) of hash functions. These are based on a general iterated hashing technique known as Merkle–Damgård construction. This type of construction was used in the MD family of hash functions and also SHA-0, SHA-1 and Whirpool hash functions among others! Our output will be 128 bits or equivalently 16 bytes.

Padding & Blocking

First, dealing with arbitrary size data usually means either processing the data one by one (one byte at a time) or in fixed-size chunks called blocks. A common padding technique is padding by adding null or 0 bytes.

Since our block size is 16 bytes, when we have the input Hello World! , each character is 1 byte (in the simple UTF-8 encoding), then we have 12 bytes, so with null padding, we get the following block: Hello World!\x00\x00\x00\x00 making the total length 16 bytes (12 original + 4 zero bytes).

However, this kind of padding is not usually secure and can be exploited for cryptographic attacks called padding attacks or padding oracle attacks (here’s the most famous one). While, they are mainly used to target public-private key encryption and block ciphers, its standard to use better padding functions for all kinds of cryptographic functions.

One such padding technique is known as PKCS#7, and works by calculating the number of padding bytes required and then each padding holds the value of the number of padding bytes. Here is an example with Hello World! , it will become Hello World!\x04\x04\x04\x04' , where \x04 is the value 4 since 4 bytes of padding were required. We will be using the pkcs7 padding for our hash function.

Here is an example spanning 4 blocks:

blocking

Iterated Hashing

The idea behind our hashing function is that for each block, we run it through a “compression function” which takes in the previous hash value and the current block data and outputs the next hash value. This is known as Merkle–Damgård construction.

There are 2 main components:

a compression function f it takes in 2 values and spits out 1 value (hence its compresses the inputs). It takes in a hash value and a block of input data.
the initial hash value in the chain of hash values is called the IV (initial value). this must be fixed and public

merkle-damgard

The key result proved by Merkle–Damgård was that: if compression function is collision resistant then the constructed hash function is collision resistant.

The code for a Merkle–Damgård construction is pretty simple:

def merkle_damgard_hash(input_bytes: bytes) -> bytes:
    H_i = IV

    for block in blocks:
        H_i = compression_fn(block, H_i)

    return H_i

Compression Function

Merkle–Damgård makes our life easier in the overall construction but now we still need to find a collision resistant compression function. Thankfully, Davies-Meyer proved that you can create a create a collision resistant compression function using block ciphers (these are used for symmetric encryption/decryption which I will cover in the future).

The key components are:

E_K(M) : this is a block cypher which has a key K and takes message M as input and spits out a cipher text (cipher text is basically the message converted to secret form)
⊕ : represents a bitwise XOR operation (equivalent to doing bit addition with no carryovers)

According to Davies-Meyer, a collision resistant compression function is: E_I(H) ⊕ H where I is the input block and H is the previous hash value. Or more simply represented in code:

def compression_fn(block, H_i):
    block_ciper = create_block_cipher(key=block)
    return xor(block_cipher.encrypt(H_i), H_i)

The only step left is figuring out what block cipher to use. The gold standard for block ciphers is the Advanced Encryption Scheme (AES). Thus, we will use that.

Lastly, we will make a small change to the davies-meyer approach to come up with a (trivially) “novel” construction called the JatinoHash!

Davies-Meyer does the following: E_I(H) ⊕ H , JatinoHash does E_I(H) ⊕ H ⊕ I (so just an extra XOR with the input block). In practice, this does not strengthen the security properties of the hash function and only makes it more computationally expensive to calculate. So, its really only useful as a learning tool.

Putting it All Together

Here is the completed code which uses padding function and AES implementations from the PyCryptodome package:

#! pip install pycryptodome

from Crypto.Cipher import AES
from Crypto.Util.Padding import pad
from pathlib import Path

IV = b'a 16 byte string'
BLOCK_SIZE_BYTES = 16   # we are using 128 bit AES

# python XOR operator ^ is between bits, so need a bytes level XOR
def bytes_xor(a: bytes, b: bytes, c: bytes) -> bytes:
    return bytes(x ^ y ^ z for x,y,z in zip(a,b,c))

def jatino_compression_fn(input_block: bytes, previous_value: bytes):
    cipher = AES.new(input_block, AES.MODE_ECB)   # ECB is basically doing independent single block operations
    return bytes_xor(cipher.encrypt(previous_value), input_block, previous_value)  # slight modification to Davies meyer

# follows standard Merkle–Damgård construction
def jatino_hash(input_bytes: bytes) -> bytes:
    # pad input_bytes so its length % BLOCK_SIZE_BYTES = 0
    input_bytes = pad(input_bytes, BLOCK_SIZE_BYTES)
    # break input bytes into separate blocks of block size
    blocks = []
    for i in range(0, len(input_bytes), BLOCK_SIZE_BYTES):
        blocks.append(input_bytes[i:i+BLOCK_SIZE_BYTES])

    H_i = IV

    # iteratively apply compression function
    for block in blocks:
        H_i = jatino_compression_fn(block, H_i)

    return H_i

Testing Our Hash Function

Calling .hex() on a bytes variable just prints it out as a hexadecimal number strin (hex string)

jatino_hash(b'Hello World!').hex() # 5d701ec2810d7d1121be1aacf6c2d058
jatino_hash(b'hello World!').hex() # 493862a5b565375c34909d0ef5e02762
jatino_hash(b'Hello World!').hex() # 5d701ec2810d7d1121be1aacf6c2d058

jatino_hash(Path("file1.txt").read_bytes().hex()) # 723e419c468d6ba41b2bcbde7c31a3c8
jatino_hash(Path("file2.txt").read_bytes().hex()) # c4d0484ad55fe2f55078b9eca26ac83c

We can observe the following:

the same input always gives the same hash (as desired)
even a single character change in the input completely changes the hash (as desired)

Sponge: A Different Approach

In recent times, a new paradigm has also appeared - the sponge construction. The SHA-3 family of hash functions is based on this new construction. You can check out the idea behind sponge construction here.

I wanted to give a shout out since my curiosity in cryptography was sparked by taking CO 487 course with Prof. Douglas Stebila at UWaterloo!