Yapping Concisely

2 basic approaches for text compression.

Published May 19, 2026in Research14 min read

While we want more powerful devices, bigger screens, storage, etc. I think we can all agree - the smaller the size of files the better. Whether it allows us to save a lot more things on a storage device, or communicate in low bandwidth, compression is one of the building blocks of our digital reality.

So, today let’s learn and implement 2 basic text compression techniques.

Can you handle the loss?

There are 2 fundamental categories of compression techniques:

1. Lossy compression

This is commonly used in images and videos. An example you may be very familiar with are JPEGs or JPGs. For videos and audio, losing some information may be ok. A change in one pixels value will likely not stop you from recognizing whether the image is of a cat or a dog.

2. Lossless compression

As the name implies, lose less compression is where we don’t discard any information from the original input when we do the compression. This is the main technique for text compression because even a change of one character can completely alter the meaning of a message: “Look a flying mat!” vs “Look a flying bat!”.

Let’s take a look at 2 different techniques that arise from looking at the problem from different perspectives.

Run Length Encoding (RLE)

RLE asks how can we compress sequences of repeating characters?

Which of these 2 sentence looks easier to compress and more compressible?

AAAAAAAAAAAAABBBBBBBB
ABAABAAABAAAABAAABBBB

Intuitively, we think that the first sentence is easier to compress and more compressible. It’s because its just a sequence of As followed by a sequence of Bs. All we would need to do is indicate the length of the sequences for each letter. So the compressed text would be A13B8 .

This is the foundation of the Run Length Encoding technique!

Here is a very simple implementation I came up with which just keeps track of the previous character and keeps adding to a counter if current character matches previous character:

def compress(input: str) -> str:
    compressedStr = ""
    prevChar = None
    charCount = 0

    for char in input:
        if prevChar != char:
            if prevChar is not None:
                compressedStr += f"{prevChar}{charCount}"
            prevChar = char
            charCount = 0
        charCount += 1

    compressedStr += f"{prevChar}{charCount}"

    return compressedStr

And the corresponding decode method:

def decompress(input: str) -> str:
    decompressedStr = ""

    i = 0
    while i < len(input):
        char = input[i]
        charCount = ""
        j = i + 1

        while j < len(input) and input[j].isdigit():
            charCount += input[j]
            j += 1

        decompressedStr += (char * (int(charCount)))
        i = j

    return decompressedStr

As you may have figured, this implementation does not work if the input text contains any numbers! There are many techniques to work around this like adding a special escape character (must not appear in the input) between the count and the character as well storing the count before the character. E.g. A12B → 1#A1#11#21#B where # is the special escape character being used.

Let’s test it out on the 2 strings we considered above:

compress("AAAAAAAAAAAAABBBBBBBB") # A13B8
compress("ABAABAAABAAAABAAABBBB") # A1B1A2B1A3B1A4B1A3B4
decompress(compress("AAAAAAAAAAAAABBBBBBBB")) # AAAAAAAAAAAAABBBBBBBB

We can see that RLE actually makes the text LONGER if there are not enough runs of the same character since space is taken up by storing the count of each character. E.g. ABAB → A1B1A1B1 ends up making the text double the original length! So much for compression!

You can imagine that this technique can be quite useful for images where you might have a bunch of repeated pixels!

Huffman Coding

Huffman Coding asks what is the most efficient representation if each appearing character must be mapped to bits?

Huffman coding shifts focus from just looking at the characters themselves to how the characters are stored. Characters like A, B, C, etc are all stored by mapping to a sequence of bits. Typically, these simple characters are 1 byte (8 bits long). For example, according to the ASCII encoding, A is 01000001 (65) and a is 01000010 (66).

Now with this knowledge, we can treat the input as a stream of bytes, where each byte is a character. So, lets go back to both our dummy example strings and represent them as bytes.

The string AAAAAAAAAAAAABBBBBBBB is stored as the following:

01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000001 01000010 01000010 01000010 01000010 01000010 01000010 01000010 01000010

An intuitive compression to store these bits is to assign A = 0 and B = 1. In other words 0 is the code of A.

So, the long bit sequence above just becomes 000000000000011111111 . If I just gave this bit sequence to someone, they would have no idea what it represents so we need to also store a table of what character is mapped to which bit sequence i.e. for this example: A = 0 and B = 1.

This is also seems to better than the RLE approach as ABAABAAABAAAABAAABBBB can be mapped 010010001000010001111 which is far better than result from RLE (using the same character encoding in the previous input string).

This looks easy enough! … Hold your horses. What if we want to encode the string: AAAABBBCC ?

We already have a code for A and B , so maybe we can just make C = 01 . This seems intuitive as A , B occur more frequently in the input so it should have shorter encoding than C which occurs less frequently.

So our string becomes: 0000111010101 . If you give this to our decoding function how does it know whether the string was AAABBBABABAB or AAAABBBABC or AAAABBBCC ? Thus, we can’t just come up with the encodings on a whim.

More formally, Huffman coding is built on 2 key ideas:

We want to assign the smaller/shorter codes to the most frequent characters
We want all codes to be prefix code. Prefix code means that no code is the prefix (the beginning) of another code. In our example, the code for A was a prefix for our code for C .

Huffman Tree

The technique developed by David Huffman is based on building up a special binary tree - called a Huffman Tree! In this the leaves of the tree each represents a character and the direction taken to get to that leaf (left = 0, right = 1) is the Huffman code for that character.

Here is the Huffman Tree produced from AAAABBBCC :

abc huffman

In this diagram the empty circles are regular nodes (not representing a character), and the filled circles are leaf nodes which represent an actual character in the input string.

By only having leaves represent characters, we guarantee that all the codes are prefix codes, since, no character node can have descendants representing a different character.

Secondly, we want leaf nodes representing more frequently occurring characters to be be closer to the root (at a lower level), so that they have shorter distance to the root and hence, have shorter Huffman codes. We can achieve this using a bottom up construction where we construct the leaf nodes with the furthest distance from the root first!

Compression

So, here is how we will construct the Huffman tree:

Go through the input and make a list of characters and their frequency of occurrence in the input text.

In a loop do the following until only 1 character remains in the list:

Take the 2 least frequently occurring characters in the list and insert them into a subtree where each character is the left or right leaf.
Sum up the frequencies of the 2 characters we took out and insert a new character in the list that is the sum of the frequencies of the 2 characters

Once we have just 1 character in the list, that is the “root” character or root node in the tree. So we can see that we build the tree up from the bottom up by combining the least frequently occurring characters and building up to the most frequent.

When doing this in code, we make use of a data structure called a heap which maintains a sorted order of elements and makes repeatedly taking the least frequent characters efficient. So here is the code to generate the tree:

from heapq import heappop, heappush

class HuffmanNode:
    def __init__(self, char: str | None, freq: int,):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    # < (less than) operator is used by heapq to do ordering
    def __lt__(self, other):
        return self.freq < other.freq

def generate_huffman_tree(inputStr: str):
    # calculate character frequencies
    frequencies: dict = {}
    for char in inputStr:
        frequencies[char] = frequencies.get(char, 0) + 1

    # create heap
    heap = []
    for char, frequency in frequencies.items():
        heappush(heap, HuffmanNode(char, frequency))

    # repeat removing 2 items from heap
    while len(heap) >= 2:
        nodeLeft = heappop(heap)
        nodeRight = heappop(heap)

        joiningNode = HuffmanNode(None, nodeLeft.freq + nodeRight.freq)
        joiningNode.left = nodeLeft
        joiningNode.right = nodeRight
        heappush(heap, joiningNode)

    return heappop(heap)

Now, we can just create the dictionary from the Huffman tree and compress an input string:

from dataclasses import dataclass

@dataclass
class CompressedText:
    text: str
    huffmanCode: dict[str, str]

def compress(inputStr: str) -> CompressedText:
    assert inputStr != ""

    # create huffman tree
    tree = generate_huffman_tree(inputStr)

    # convert huffman tree to dictionary
    huffmanCode = dict()
    tree_to_dict(tree, huffman_code, "")

    # use tree to convert input to compressed text
    output = ""
    for char in inputStr:
        output += huffmanCode[char]

    return CompressedText(output, huffmanCode)

def tree_to_dict(node: HuffmanNode, dict: dict[str, str], path: str):
    if node.char is not None:
        dict[node.char] = path

    if node.left:
        tree_to_dict(node.left, dict, path + "0")
    if node.right:
        tree_to_dict(node.right, dict, path + "1")

Decompression

Decompression is very straightforward once we have the dictionary mapping input to Huffman code as we can just reverse it and do dictionary look ups:

def decompress(compressed: CompressedText) -> str:
    i = 0
    output = ""
    inverse_dict = {v: k for k, v in compressed.huffman_code.items()}

    while i < len(compressed.text):
        j = i + 1

        # find matching pattern
        while compressed.text[i:j] not in inverse_dict:
            j += 1

        output += inverse_dict[compressed.text[i:j]]

        i = j

    return output

A Real World Example

Let’s run our implementation of Huffman Coding on a text file containing the contents of a English translation of The Prince by Nicolo Machiavelli (obtained from Project Gutenberg):

compressed = compress(Path('the_prince.txt').read_text())

The generated Huffman codes are:

{
    " ": "000",
    "r": "00100",
    "u": "001010",
    "\n": "001011000",
    "“": "001011001000",
    "8": "00101100100100",
    ")": "00101100100101",
    "0": "0010110010011",
    "O": "00101100101",
    "C": "0010110011",
    "U": "0010110100000",
    "Y": "0010110100001",
    "W": "001011010001",
    "1": "00101101001",
    "S": "0010110101",
    "I": "001011011",
    "z": "00101110000",
    "M": "00101110001",
    "R": "0010111001",
    "j": "00101110100",
    "(": "00101110101000",
    "Q": "001011101010010",
    "Æ": "0010111010100110",
    "‘": "0010111010100111",
    "9": "0010111010101",
    ":": "001011101011",
    "E": "0010111011",
    ".": "00101111",
    "t": "0011",
    "m": "010000",
    "f": "010001",
    "G": "01001000000",
    "4": "010010000010",
    "5": "010010000011",
    "F": "0100100001",
    ";": "010010001",
    "A": "010010010",
    "]": "01001001100",
    "[": "01001001101",
    "N": "0100100111",
    "v": "0100101",
    "w": "010011",
    "o": "0101",
    "d": "01100",
    ",": "011010",
    "y": "011011",
    "a": "0111",
    "i": "1000",
    "n": "1001",
    "V": "10100000000",
    "L": "10100000001",
    "H": "1010000001",
    "—": "1010000010000",
    "7": "10100000100010",
    "?": "10100000100011",
    "6": "1010000010010",
    "J": "1010000010011",
    "D": "10100000101",
    "B": "1010000011",
    "k": "10100001",
    "T": "101000100",
    "P": "1010001010",
    "X": "101000101100",
    "-": "101000101101",
    "K": "101000101110",
    "’": "101000101111",
    "q": "1010001100",
    "2": "101000110100",
    "3": "101000110101",
    "”": "10100011011",
    "x": "101000111",
    "p": "101001",
    "l": "10101",
    "h": "1011",
    "e": "110",
    "s": "1110",
    "b": "111100",
    "g": "111101",
    "c": "11111",
}

We can see that the codes would align pretty closely with our general guess. The shortest code is given the the SPACE character since its used the most frequently. The longest code is given to less used characters like Æ , 6 , 7 (hehe).

When measuring the compression level, we see that the original text is 1453672 bits (each character is assumed to be 1 byte/8 bits). The compressed text uses just 799781 bits. This is a 45% reduction.

With this larger example, I hope you can see how the compression is much better (for reference RLE nearly doubles the size for The Prince example; so almost a -100% reduction). Although there is a tradeoff between compression and runtime performance. We can see that for the simpler method like RLE, we just do one linear scan of the input sequence and we are done. However, for Huffman coding, it has greater time and space (have to computer Huffman tree, frequency dict) complexity.

Not Good Enough

The techniques covered in this article are not really used at all today. This is because doing character level compression just isn’t good enough. Modern compression techniques also need to find patterns and find a more compressed representation for those patterns. Originally, this article was also supposed to cover the LZW algorithm which is concerned about finding patterns. However, it turned out to be quite a bit more complicated than the techniques mentioned here. To avoid making a really long article, I am working on writing a dedicated article to show the evolution of the LZ77 algorithm to LZ78 and then LZW!