Decompress String

In this exercise, you're going to decompress a compressed string.

Your input is a compressed string of the formatnumber[string]and the decompressed output form should be thestringwrittennumbertimes. For example:

Example:
input: 3[abc]4[ab]c
output: asabcabcabcababababc

Other rules

  • Number can have more than one digit. For example,10[a]is allowed, and just means aaaaaaaaaa

  • One repetition can occur inside another. For example,2[3[a]b]decompresses intoaaabaaab

  • Characters allowed as input include digits, small English letters and brackets[].

  • Digits are only to represent amount of repetitions.

  • Letters are just letters.

  • Brackets are only part of syntax of writing repeated substring.

  • Input is always valid, so no need to check its validity.

The Idea: The first thing that needs to be done is splitting the input strings into tokens we can recognize. This will simplify things a lot later. The tokens we are interested in are brackets, numbers, and characters. Next we step into an algorithm that will match brackets. We can simulate this using a stack. On the instance we find a closing ], the top of the stack will reveal the opening brace. We can use the top of the index to extract two details: the string, and the multiplier for the string, of which will appear to the right and left of the opening delimiter respectfully.

Then come the most crucial part of the algorithm: 1) If we find a previous instance in the stack that is + 1 from its current length, we do this: tree_level[len(s)] += multiplier * (my_str + tree_level[len(s) + 1]), otherwise we do this: tree_level[len(s)] += (multiplier * my_str). The first identifies when we have a nested case: a[2b[3c]], for example. so that ccc from stack 2 becomes bbccc in stack 1. The point is that a could have access to bbcc by looking at stack 0 + 1, which becomes abbccc. Eventually the root node is reached, and we can return 0th index of tree_level.

In the second case, if there isnt a child (+1 stack index), then we append to the string on the same level. This would be equivalent to the non-nested case. For example, 3[abc]3[d], where both d and abc point to the same stack level (0).

Complexity:

import re
import collections


def extract_components(tokens, open_i):
    my_str = tokens[open_i + 1]
    multiplier = tokens[open_i - 1]
    return int(multiplier), my_str


def uncompress_str(compr):
    if not compr:
        return ""

    # delimit by '[' or ']' or by numbers, then by default the characters will be remain and be delimited as well
    # then filter out spaces to ensure proper / expected indexing in our algorithm
    tokens = re.split('(\[|\]|[0-9]+)', compr)
    tokens = [token for token in tokens if token != '']
    tree_level = collections.defaultdict(str)
    s = []

    for i, token in enumerate(tokens):
        if token == '[':
            s.append(i)
        elif token == ']':
            multiplier, my_str = extract_components(tokens, s.pop())
            if my_str.isnumeric():  # checks for instance of empty string
                my_str = ''
            if len(s) + 1 in tree_level:
                tree_level[len(s)] += multiplier * (my_str + tree_level[len(s) + 1])
            else:
                tree_level[len(s)] += (multiplier * my_str)
    return tree_level[0]

Testing

t = '2[abc]2[aaa3[ok2[zz]3[qq2[d]]]]'
print(uncompress_str(t))  # 'abcabcaaaokzzzzqqddqqddqqddokzzzzqqddqqddqqddokzzzzqqddqqddqqddaaaokzzzzqqddqqddqqddokzzzzqqddqqddqqddokzzzzqqddqqddqqdd'

t = '2[abc]3[b]0[nope]2[aaa3[ok]]'
print(uncompress_str(t))  # 'abcabcbbbaaaokokokaaaokokok'

t = '2[3[a]1[b]]'
print(uncompress_str(t))  # 'aaabaaab'

t = '3[b2[c10[d]]]'
print(uncompress_str(t))  # 'bcddddddddddcddddddddddbcddddddddddcddddddddddbcddddddddddcdddddddddd'

t = '3[abc4[ab]]'
print(uncompress_str(t))  # 'abcabababababcabababababcabababab'

t = '3[abc]4[ab]1[c]'
print(uncompress_str(t))  # 'abcabcabcababababc'

t = '10[a]'
print(uncompress_str(t))  # 'aaaaaaaaaa'

t = '1[b2[c3[d2[abcd]]]]'
print(uncompress_str(t))  # 'bcdabcdabcddabcdabcddabcdabcdcdabcdabcddabcdabcddabcdabcd'

Last updated