{
    "componentChunkName": "component---src-templates-blog-post-js",
    "path": "/yapping-concisely/",
    "result": {"data":{"contentfulBlog":{"slug":"yapping-concisely","title":"Yapping Concisely","publishDate":"May 19, 2026","readTime":14,"topic":"Research","tagline":{"tagline":"2 basic approaches for text compression."},"body":{"body":"While we want more powerful devices, bigger screens, storage, etc. I think we can all agree - the smaller the size of files the better. Whether it allows us to save a lot more things on a storage device, or communicate in low bandwidth, **compression** is one of the building blocks of our digital reality.\n\nSo, today let’s learn and implement 2 basic text compression techniques.\n\n---\n\n# Can you handle the loss?\n\nThere are 2 fundamental categories of compression techniques:\n\n## 1. Lossy compression\n\nThis is commonly used in images and videos. An example you may be very familiar with are JPEGs or JPGs.  For videos and audio, losing some information may be ok. A change in one pixels value will likely not stop you from recognizing whether the image is of a cat or a dog.\n\n## 2. Lossless compression\n\nAs the name implies, lose less compression is where we don’t discard any information from the original input when we do the compression. This is the main technique for text compression because even a change of one character can completely alter the meaning of a message: “Look a flying mat!” vs “Look a flying bat!”.\n\n---\n\nLet’s take a look at 2 different techniques that arise from looking at the problem from different perspectives.\n\n# Run Length Encoding (RLE)\n\n> RLE asks *how can we compress sequences of repeating characters*?\n> \n\nWhich of these 2 sentence looks easier to compress and more compressible?\n\n1. `AAAAAAAAAAAAABBBBBBBB`\n2. `ABAABAAABAAAABAAABBBB`\n\nIntuitively, we think that the first sentence is easier to compress and more compressible. It’s because its just a sequence of `A`s followed by a sequence of `B`s. All we would need to do is indicate the length of the sequences for each letter. So the compressed text would be `A13B8` .\n\nThis is the foundation of the Run Length Encoding technique! \n\nHere is a very simple implementation I came up with which just keeps track of the previous character and keeps adding to a counter if current character matches previous character: \n\n```python\ndef compress(input: str) -> str:\n    compressedStr = \"\"\n    prevChar = None\n    charCount = 0\n\n    for char in input:\n        if prevChar != char:\n            if prevChar is not None:\n                compressedStr += f\"{prevChar}{charCount}\"\n            prevChar = char\n            charCount = 0\n        charCount += 1\n\n    compressedStr += f\"{prevChar}{charCount}\"\n\n    return compressedStr\n```\n\nAnd the corresponding decode method:\n\n```python\ndef decompress(input: str) -> str:\n    decompressedStr = \"\"\n\n    i = 0\n    while i < len(input):\n        char = input[i]\n        charCount = \"\"\n        j = i + 1\n\n        while j < len(input) and input[j].isdigit():\n            charCount += input[j]\n            j += 1\n\n        decompressedStr += (char * (int(charCount)))\n        i = j\n\n    return decompressedStr\n```\n\nAs you may have figured, this implementation does not work if the input text contains any numbers! There are many techniques to work around this like adding a special escape character (must not appear in the input) between the count and the character as well storing the count before the character. E.g. `A12B` → `1#A1#11#21#B` where `#` is the special escape character being used.\n\nLet’s test it out on the 2 strings we considered above:\n\n```python\ncompress(\"AAAAAAAAAAAAABBBBBBBB\") # A13B8\ncompress(\"ABAABAAABAAAABAAABBBB\") # A1B1A2B1A3B1A4B1A3B4\ndecompress(compress(\"AAAAAAAAAAAAABBBBBBBB\")) # AAAAAAAAAAAAABBBBBBBB\n```\n\nWe can see that RLE actually makes the text LONGER if there are not enough runs of the same character since space is taken up by storing the count of each character. E.g. `ABAB` → `A1B1A1B1` ends up making the text double the original length! So much for compression!\n\nYou can imagine that this technique can be quite useful for images where you might have a bunch of repeated pixels!\n\n# Huffman Coding\n\n> Huffman Coding asks w*hat is the most efficient representation if each appearing character must be mapped to bits?*\n> \n\nHuffman coding shifts focus from just looking at the characters themselves to how the characters are stored. Characters like `A`, `B`, `C`, etc are all stored by mapping to a sequence of bits. Typically, these simple characters are 1 byte (8 bits long). For example, according to the ASCII encoding, `A` is `01000001` (65) and `a` is `01000010` (66).\n\nNow with this knowledge, we can treat the input as a stream of bytes, where each byte is a character. So, lets go back to both our dummy example strings and represent them as bytes.\n\nThe string `AAAAAAAAAAAAABBBBBBBB` is stored as the following:\n\n`01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000010 01000010 01000010 01000010 01000010 01000010 01000010 01000010`\n\nAn intuitive compression to store these bits is to assign `A` = 0 and `B` = 1. In other words 0 is the **code** of `A`.\n\nSo, the long bit sequence above just becomes `000000000000011111111` . If I just gave this bit sequence to someone, they would have no idea what it represents so we need to also store a table of what character is mapped to which bit sequence i.e. for this example: `A` = 0 and `B` = 1.\n\nThis is also seems to better than the RLE approach as `ABAABAAABAAAABAAABBBB`  can be mapped `010010001000010001111`  which is far better than result from RLE (using the same character encoding in the previous input string).\n\nThis looks easy enough! … Hold your horses. What if we want to encode the string: `AAAABBBCC` ?\n\nWe already have a code for `A` and `B` , so maybe we can just make `C` = `01` . This seems intuitive as `A` , `B` occur more frequently in the input so it should have shorter encoding than `C` which occurs less frequently.\n\n So our string becomes: `0000111010101` . If you give this to our decoding function how does it know whether the string was `AAABBBABABAB` or `AAAABBBABC` or `AAAABBBCC` ? Thus, we can’t just come up with the encodings on a whim.\n\nMore formally, Huffman coding is built on 2 key ideas:\n\n- We want to assign the smaller/shorter codes to the most frequent characters\n- We want all codes to be **prefix code**. Prefix code means that no code is the prefix (the beginning) of another code. In our example, the code for `A` was a prefix for our code for `C` .\n\n## Huffman Tree\n\nThe technique developed by David Huffman is based on building up a special binary tree - called a Huffman Tree! In this the leaves of the tree each represents a character and the direction taken to get to that leaf (left = 0, right = 1) is the Huffman code for that character. \n\nHere is the Huffman Tree produced from `AAAABBBCC` :\n\n![abc huffman](//images.ctfassets.net/q8k0ufon75o2/5mRuj2yeVuHViUac0GALPj/53b171456b1f222da2ca6452463d31c3/abc_huffman.svg)\n\n*In this diagram the empty circles are regular nodes (not representing a character), and the filled circles are leaf nodes which represent an actual character in the input string.*\n\nBy only having leaves represent characters, we guarantee that all the codes are prefix codes, since, no character node can have descendants representing a different character. \n\nSecondly, we want leaf nodes representing more frequently occurring characters to be be closer to the root (at a lower level), so that they have shorter distance to the root and hence, have shorter Huffman codes. We can achieve this using a bottom up construction where we construct the leaf nodes with the furthest distance from the root first!\n\n## Compression\n\nSo, here is how we will construct the Huffman tree:\n\n1. Go through the input and make a list of characters and their frequency of occurrence in the input text.\n\nIn a loop do the following until only 1 character remains in the list:\n\n1. Take the 2 least frequently occurring characters in the list and insert them into a subtree where each character is the left or right leaf.\n2. Sum up the frequencies of the 2 characters we took out and insert a new character in the list that is the sum of the frequencies of the 2 characters\n\nOnce we have just 1 character in the list, that is the “root” character or root node in the tree. So we can see that we build the tree up from the bottom up by combining the least frequently occurring characters and building up to the most frequent.\n\nWhen doing this in code, we make use of a data structure called a heap which maintains a sorted order of elements and makes repeatedly taking the least frequent characters efficient. So here is the code to generate the tree:\n\n```python\nfrom heapq import heappop, heappush\n\nclass HuffmanNode:\n    def __init__(self, char: str | None, freq: int,):\n        self.char = char\n        self.freq = freq\n        self.left = None\n        self.right = None\n\n    # < (less than) operator is used by heapq to do ordering\n    def __lt__(self, other):\n        return self.freq < other.freq\n\ndef generate_huffman_tree(inputStr: str):\n    # calculate character frequencies\n    frequencies: dict = {}\n    for char in inputStr:\n        frequencies[char] = frequencies.get(char, 0) + 1\n\n    # create heap\n    heap = []\n    for char, frequency in frequencies.items():\n        heappush(heap, HuffmanNode(char, frequency))\n\n    # repeat removing 2 items from heap\n    while len(heap) >= 2:\n        nodeLeft = heappop(heap)\n        nodeRight = heappop(heap)\n\n        joiningNode = HuffmanNode(None, nodeLeft.freq + nodeRight.freq)\n        joiningNode.left = nodeLeft\n        joiningNode.right = nodeRight\n        heappush(heap, joiningNode)\n\n    return heappop(heap)\n```\n\nNow, we can just create the dictionary from the Huffman tree and compress an input string:\n\n```python\nfrom dataclasses import dataclass\n\n@dataclass\nclass CompressedText:\n    text: str\n    huffmanCode: dict[str, str]\n\ndef compress(inputStr: str) -> CompressedText:\n    assert inputStr != \"\"\n\n    # create huffman tree\n    tree = generate_huffman_tree(inputStr)\n\n    # convert huffman tree to dictionary\n    huffmanCode = dict()\n    tree_to_dict(tree, huffman_code, \"\")\n\n    # use tree to convert input to compressed text\n    output = \"\"\n    for char in inputStr:\n        output += huffmanCode[char]\n\n    return CompressedText(output, huffmanCode)\n\ndef tree_to_dict(node: HuffmanNode, dict: dict[str, str], path: str):\n    if node.char is not None:\n        dict[node.char] = path\n\n    if node.left:\n        tree_to_dict(node.left, dict, path + \"0\")\n    if node.right:\n        tree_to_dict(node.right, dict, path + \"1\")\n```\n\n## Decompression\n\nDecompression is very straightforward once we have the dictionary mapping input to Huffman code as we can just reverse it and do dictionary look ups:\n\n```python\ndef decompress(compressed: CompressedText) -> str:\n    i = 0\n    output = \"\"\n    inverse_dict = {v: k for k, v in compressed.huffman_code.items()}\n\n    while i < len(compressed.text):\n        j = i + 1\n\n        # find matching pattern\n        while compressed.text[i:j] not in inverse_dict:\n            j += 1\n\n        output += inverse_dict[compressed.text[i:j]]\n\n        i = j\n\n    return output\n```\n\n## A Real World Example\n\nLet’s run our implementation of Huffman Coding on a text file containing the contents of a English translation of *The Prince* by Nicolo Machiavelli (obtained from [Project Gutenberg](https://www.gutenberg.org/files/1232/1232-h/1232-h.htm#chap00)):\n\n```python\ncompressed = compress(Path('the_prince.txt').read_text())\n```\n\nThe generated Huffman codes are:\n\n```python\n{\n    \" \": \"000\",\n    \"r\": \"00100\",\n    \"u\": \"001010\",\n    \"\\n\": \"001011000\",\n    \"“\": \"001011001000\",\n    \"8\": \"00101100100100\",\n    \")\": \"00101100100101\",\n    \"0\": \"0010110010011\",\n    \"O\": \"00101100101\",\n    \"C\": \"0010110011\",\n    \"U\": \"0010110100000\",\n    \"Y\": \"0010110100001\",\n    \"W\": \"001011010001\",\n    \"1\": \"00101101001\",\n    \"S\": \"0010110101\",\n    \"I\": \"001011011\",\n    \"z\": \"00101110000\",\n    \"M\": \"00101110001\",\n    \"R\": \"0010111001\",\n    \"j\": \"00101110100\",\n    \"(\": \"00101110101000\",\n    \"Q\": \"001011101010010\",\n    \"Æ\": \"0010111010100110\",\n    \"‘\": \"0010111010100111\",\n    \"9\": \"0010111010101\",\n    \":\": \"001011101011\",\n    \"E\": \"0010111011\",\n    \".\": \"00101111\",\n    \"t\": \"0011\",\n    \"m\": \"010000\",\n    \"f\": \"010001\",\n    \"G\": \"01001000000\",\n    \"4\": \"010010000010\",\n    \"5\": \"010010000011\",\n    \"F\": \"0100100001\",\n    \";\": \"010010001\",\n    \"A\": \"010010010\",\n    \"]\": \"01001001100\",\n    \"[\": \"01001001101\",\n    \"N\": \"0100100111\",\n    \"v\": \"0100101\",\n    \"w\": \"010011\",\n    \"o\": \"0101\",\n    \"d\": \"01100\",\n    \",\": \"011010\",\n    \"y\": \"011011\",\n    \"a\": \"0111\",\n    \"i\": \"1000\",\n    \"n\": \"1001\",\n    \"V\": \"10100000000\",\n    \"L\": \"10100000001\",\n    \"H\": \"1010000001\",\n    \"—\": \"1010000010000\",\n    \"7\": \"10100000100010\",\n    \"?\": \"10100000100011\",\n    \"6\": \"1010000010010\",\n    \"J\": \"1010000010011\",\n    \"D\": \"10100000101\",\n    \"B\": \"1010000011\",\n    \"k\": \"10100001\",\n    \"T\": \"101000100\",\n    \"P\": \"1010001010\",\n    \"X\": \"101000101100\",\n    \"-\": \"101000101101\",\n    \"K\": \"101000101110\",\n    \"’\": \"101000101111\",\n    \"q\": \"1010001100\",\n    \"2\": \"101000110100\",\n    \"3\": \"101000110101\",\n    \"”\": \"10100011011\",\n    \"x\": \"101000111\",\n    \"p\": \"101001\",\n    \"l\": \"10101\",\n    \"h\": \"1011\",\n    \"e\": \"110\",\n    \"s\": \"1110\",\n    \"b\": \"111100\",\n    \"g\": \"111101\",\n    \"c\": \"11111\",\n}\n\n```\n\nWe can see that the codes would align pretty closely with our general guess. The shortest code is given the the SPACE character since its used the most frequently. The longest code is given to less used characters like `Æ` , `6` , `7` (hehe).\n\nWhen measuring the compression level, we see that the original text is `1453672` bits (each character is assumed to be 1 byte/8 bits). The compressed text uses just `799781` bits. This is a 45% reduction.\n\nWith this larger example, I hope you can see how the compression is much better (for reference RLE nearly doubles the size for *The Prince* example; so almost a -100% reduction). Although there is a tradeoff between compression and runtime performance. We can see that for the simpler method like RLE, we just do one linear scan of the input sequence and we are done. However, for Huffman coding, it has greater time and space (have to computer Huffman tree, frequency dict) complexity.\n\n---\n\n# Not Good Enough\n\nThe techniques covered in this article are not really used at all today. This is because doing character level compression just isn’t good enough. Modern compression techniques also need to find patterns and find a more compressed representation for those patterns.  Originally, this article was also supposed to cover the LZW algorithm which is concerned about finding patterns. However, it turned out to be quite a bit more complicated than the techniques mentioned here. To avoid making a really long article, I am working on writing a dedicated article to show the evolution of the LZ77 algorithm to LZ78 and then LZW!"}}},"pageContext":{"slug":"yapping-concisely"}},
    "staticQueryHashes": ["3000541721"]}