Implementation

Data-structures:

  • Priority-queue

  • Prefix tree

English

  • Place all frequencies of nodes into a priority queue, by ascending order

  • Select two most minimum frequencies (delete min x2), sum these together, and insert a new frequency fw+1f_{w+1}.

  • Use this to build internal nodes for a binary-tree and adds an addition bit to each of the subtrees involved, bottom up. Have leaf nodes contain the data at the bottom.

  • Stop once priority queue is <1 (the root node is reached).

Self Notes

  • Node represents both an internal node within the tree, as well as the base nodes for the original characters. I used two different constructors to identify them both.

  • Analyzing the file had just meant finding the frequency of characters, and then doing some math to discover the total number of bits used. I used the following calcuations to analyize the compression.

// original
bitlength = log_2(#unique_chars)
byte_size_per_char = bit_length * freq
total_byes = sum(byte_size_per_char)

// compressed
count_bits = #of bits to reach parent node from base
byte_size_per_char = count_bits * freq
total_bytes = sum(byte_size_per_char)
  • Initially I defined the base characters the base of the tree in base_nodes hashmap.

  • Inserting the node meant adding having two base nodes point to a combined parent node.

  • With id_char_bit_seq, I decided to not go the extra mile of hashing the user id. This is ok since the number of unique characters tends not to exceed 300 (ASCII).

Last updated