{
    "componentChunkName": "component---src-templates-blog-post-js",
    "path": "/hash-functions/",
    "result": {"data":{"contentfulBlog":{"slug":"hash-functions","title":"Hash functions","publishDate":"May 4, 2026","readTime":12,"topic":"Cryptography","tagline":{"tagline":"A gentle introduction to the backbone of data integrity."},"body":{"body":"There is an operation that computers like do 1000s of times a minute, yet most of us don’t know how it works. It’s an operations that is used\n\n- in all sorts of data structures. Most important of which is a hash table\n- every time you download a file ([checksums](https://en.wikipedia.org/wiki/Checksum))\n- on every block chain\n- when you try to login to something using your password\n- for generating pesudorandom numbers\n\nI am talking about **hash functions**!\n\nDespite the abundance of its use, you may not have heard of hash functions. So, I embark on a goal to make hash functions the first thing you think of when some one says hash and not hash tag or hash brown (a nerd can dream).\n\n# Definition\n\nA hash function is a function that takes in arbitrary length input and produce fixed length output, often called the hash value or digest. \n\nWe will focus on a specific type of hash functions which have the following additional constraint - the function should always return the same hash value for the same input.\n\nLet’s see an example with a popular hash function called SHA-256:\n\n`Hello World!`  → `7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069`\n\n`Hello World! This is a longer sentence.` → `bd7f28d225df2728aa4964dfcad05ef955513972fed1e187e41d1275e67884a4`\n\n# Cryptographic hash functions\n\nThese are 2 key properties that make a **cryptographic** hash function i.e. secure to be used for data integrity perspective.\n\n## 1. Collision-resistance\n\nIt is “difficult” to find 2 inputs that map to the same hash value.\n\n> Defining what is “difficult” is extremely important! There are many different types of “difficult” in cryptography. For this article, we will difficulty as being computationally difficult. Meaning that trying to find a collision would probabilistically (since we can always get lucky) require too many computations (e.g. 2^128 operations).\n> \n\nThis is a key property required by hash tables because when two keys/inputs have the same hash, that is called a **collision** and collisions are not good because we may have to move existing values in the hash tables.\n\n## 2. Pre-image resistance\n\nFormally, it means that suppose given some hash value *y,* then it is difficult to find an input x such that its hash value is y. We can intuitively think of this property as the “**one-wayness**” property as it says that it is easy to go from input to output but difficult to go from output to input (the reverse direction).\n\nWhile not a direct requirement, we can intuitively think of these properties as being encompassed by the “avalanche” effect, where we want any change in the input to produce a completely different hash value such that the relationship between the input and its hash value is complex.\n\n# Let’s hash out the details\n\nLet’s construct a new hash function, which is roughly based on the SHA-2 family (e.g SHA 256) of hash functions. These are based on a general iterated hashing technique known as Merkle–Damgård construction. This type of construction was used in the MD family of hash functions and also SHA-0, SHA-1 and Whirpool hash functions among others! Our output will be 128 bits or equivalently 16 bytes.\n\n## Padding & Blocking\n\nFirst, dealing with arbitrary size data usually means either processing the data one by one (one byte at a time) or in fixed-size chunks called **blocks**. A common padding technique is padding by adding null or 0 bytes. \n\nSince our block size is 16 bytes, when we have the input `Hello World!` , each character is 1 byte (in the simple UTF-8 encoding), then we have 12 bytes, so with null padding, we get the following block: `Hello World!\\x00\\x00\\x00\\x00` making the total length 16 bytes (12 original + 4 zero bytes).\n\nHowever, this kind of padding is not usually secure and can be exploited for cryptographic attacks called padding attacks or padding oracle attacks ([here’s](https://archiv.infsec.ethz.ch/education/fs08/secsem/bleichenbacher98.pdf) the most famous one). While, they are mainly used to target public-private key encryption and block ciphers, its standard to use better padding functions for all kinds of cryptographic functions.\n\nOne such padding technique is known as **PKCS#7,** and works by calculating the number of padding bytes required and then each padding holds the value of the number of padding bytes. Here is an example with `Hello World!` , it will become `Hello World!\\x04\\x04\\x04\\x04'` , where `\\x04` is the value 4 since 4 bytes of padding were required. We will be using the pkcs7 padding for our hash function.\n\nHere is an example spanning 4 blocks:\n\n![blocking](//images.ctfassets.net/q8k0ufon75o2/77hhNkWjdhYECLifMeMI3a/dcd7345b73f68e52882d4c211b4b0900/blocking.svg)\n\n## Iterated Hashing\n\nThe idea behind our hashing function is that for each block, we run it through a “compression function” which takes in the previous hash value and the current block data and outputs the next hash value. This is known as __Merkle–Damgård__ construction.\n\nThere are 2 main components:\n\n- a compression function *f* it takes in 2 values and spits out 1 value (hence its compresses the inputs). It takes in a hash value and a block of input data.\n- the initial hash value in the chain of hash values is called the IV (initial value). this must be fixed and public\n\n![merkle-damgard](//images.ctfassets.net/q8k0ufon75o2/eWgo1cTwv3lgL338wRifO/f217648e2af59cd5f64e20c67ece72c9/merkle-damgard.svg)\n\nThe key result proved by Merkle–Damgård was that: *if compression function is collision resistant then the constructed hash function is collision resistant*.\n\nThe code for a Merkle–Damgård construction is pretty simple:\n\n```python\ndef merkle_damgard_hash(input_bytes: bytes) -> bytes:\n    H_i = IV\n\n    for block in blocks:\n        H_i = compression_fn(block, H_i)\n\n    return H_i\n```\n\n## Compression Function\n\nMerkle–Damgård makes our life easier in the overall construction but now we still need to find a collision resistant  compression function. Thankfully, Davies-Meyer proved that you can create a create a collision resistant compression function using block ciphers (these are used for symmetric encryption/decryption which I will cover in the future).\n\nThe key components are:\n\n- `E_K(M)` : this is a block cypher which has a key `K` and takes message `M` as input and spits out a cipher text (cipher text is basically the message converted to secret form)\n- `⊕` : represents a bitwise XOR operation (equivalent to doing bit addition with no carryovers)\n\nAccording to __Davies-Meyer__, a collision resistant compression function is: `E_I(H) ⊕ H` where `I` is the input block and `H` is the previous hash value. Or more simply represented in code:\n\n```python\ndef compression_fn(block, H_i):\n    block_ciper = create_block_cipher(key=block)\n    return xor(block_cipher.encrypt(H_i), H_i)\n```\n\nThe only step left is figuring out what block cipher to use. The gold standard for block ciphers is the Advanced Encryption Scheme (AES). Thus, we will use that.\n\nLastly, we will make a small change to the davies-meyer approach to come up with a (trivially) “novel” construction called the **JatinoHash!**\n\nDavies-Meyer does the following: `E_I(H) ⊕ H` , JatinoHash does `E_I(H) ⊕ H ⊕ I` (so just an extra XOR with the input block). In practice, this does not strengthen the security properties of the hash function and only makes it more computationally expensive to calculate. So, its really only useful as a learning tool.\n\n## Putting it All Together\n\nHere is the completed code which uses padding function and AES implementations from the [PyCryptodome package](https://www.pycryptodome.org/):\n\n```python\n#! pip install pycryptodome\n\nfrom Crypto.Cipher import AES\nfrom Crypto.Util.Padding import pad\nfrom pathlib import Path\n\nIV = b'a 16 byte string'\nBLOCK_SIZE_BYTES = 16   # we are using 128 bit AES\n\n# python XOR operator ^ is between bits, so need a bytes level XOR\ndef bytes_xor(a: bytes, b: bytes, c: bytes) -> bytes:\n    return bytes(x ^ y ^ z for x,y,z in zip(a,b,c))\n\ndef jatino_compression_fn(input_block: bytes, previous_value: bytes):\n    cipher = AES.new(input_block, AES.MODE_ECB)   # ECB is basically doing independent single block operations\n    return bytes_xor(cipher.encrypt(previous_value), input_block, previous_value)  # slight modification to Davies meyer\n\n# follows standard Merkle–Damgård construction\ndef jatino_hash(input_bytes: bytes) -> bytes:\n    # pad input_bytes so its length % BLOCK_SIZE_BYTES = 0\n    input_bytes = pad(input_bytes, BLOCK_SIZE_BYTES)\n    # break input bytes into separate blocks of block size\n    blocks = []\n    for i in range(0, len(input_bytes), BLOCK_SIZE_BYTES):\n        blocks.append(input_bytes[i:i+BLOCK_SIZE_BYTES])\n\n    H_i = IV\n\n    # iteratively apply compression function\n    for block in blocks:\n        H_i = jatino_compression_fn(block, H_i)\n\n    return H_i\n```\n\n## Testing Our Hash Function\n\nCalling `.hex()` on a bytes variable just prints it out as a hexadecimal number strin (hex string)\n\n```python\njatino_hash(b'Hello World!').hex() # 5d701ec2810d7d1121be1aacf6c2d058\njatino_hash(b'hello World!').hex() # 493862a5b565375c34909d0ef5e02762\njatino_hash(b'Hello World!').hex() # 5d701ec2810d7d1121be1aacf6c2d058\n\njatino_hash(Path(\"file1.txt\").read_bytes().hex()) # 723e419c468d6ba41b2bcbde7c31a3c8\njatino_hash(Path(\"file2.txt\").read_bytes().hex()) # c4d0484ad55fe2f55078b9eca26ac83c\n```\n\nWe can observe the following:\n\n- the same input always gives the same hash (as desired)\n- even a single character change in the input completely changes the hash (as desired)\n\n---\n\n# Sponge: A Different Approach\n\nIn recent times, a new paradigm has also appeared - the **sponge construction**. The SHA-3 family of hash functions is based on this new construction. You can check out the idea behind sponge construction [here](https://keccak.team/sponge_duplex.html).\n\n---\n*I wanted to give a shout out since my curiosity in cryptography was sparked by taking CO 487 course with [Prof. Douglas Stebila](https://www.douglas.stebila.ca/) at UWaterloo!*"}}},"pageContext":{"slug":"hash-functions"}},
    "staticQueryHashes": ["3000541721"]}