> For the complete documentation index, see [llms.txt](https://maksimdan.gitbook.io/interview-practice-problems/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://maksimdan.gitbook.io/interview-practice-problems/leetcode_sessions/trees-and-graphs/html-parser.md).

# HTML Parser

How would you parse out HTML tags from a string?

**Example:**

```
input:
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

---

output:
[('h1', 'My First Heading'), 
('p', 'My first paragraph.'), 
('body', '\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n'), 
('html', '\n<body>\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n</body>\n'),
```

**Note:** This is not meant to be a completely robust implementation. It obviously does not parse every aspect of html.

**The Idea:** The key is to identify the grammar of an html tag, and use a stack to facilitate the deconstruction of the string. An html tag has the following simple grammar: `html ::= '<' tag '>' exp '</' tag '>'` where tag and exp are arbitrary strings. Each token can be uniquely identified on the basis of what came before it within the stack.

**Complexity:** O(n) time and O(n) space (consider the html string: `'<a>b</a>'`)

```python
from enum import Enum


class Token(Enum):
    OPEN1 = 1
    OPEN2 = 2
    CLOSE1 = 3
    CLOSE2 = 4


def html_parser(html_str):
    init_html = len("<!DOCTYPE html>")
    assert(html_str[0:init_html] == "<!DOCTYPE html>")
    html_str = html_str[init_html:]
    tag_expr = []
    s = []
    for i, char in enumerate(html_str):
        # this if statement is ultimately just a placeholder for the second statement
        if char == '<' and i+1 < len(html_str) and html_str[i+1] != '/':
            s.append((i, Token.OPEN1))

        # this will be used to identify the start of the expression
        elif char == '>' and len(s) != 0 and s[-1][1] == Token.OPEN1:
            s.pop()
            s.append((i, Token.CLOSE1))

        # this identifies the end of the expression
        elif char == '<' and i+1 < len(html_str) and s[-1][1] == Token.CLOSE1:
            tmp_expr = html_str[s[-1][0] + 1:i]
            s.pop()
            s.append((i, Token.OPEN2))

        # this identifies the tag name (see that the first tag name is ignored)
        elif char == '>' and len(s) != 0 and s[-1][1] == Token.OPEN2:
            tmp_tag = html_str[s[-1][0] + 2:i]
            tag_expr.append((tmp_tag, tmp_expr))
            s.pop()

    return tag_expr
```

**Testing**

```python
ex_doc1 = \
"""<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>"""

print(html_parser(ex_doc1))


ex_doc2 = \
"""<!DOCTYPE html>
<html>
<body>

<p>
This paragraph
contains a lot of lines
in the source code,
but the browser
ignores it.
</p>

<p>
This paragraph
contains a lot of spaces
in the source code,
but the browser
ignores it.
</p>

<p>
The number of lines in a paragraph depends on the size of the browser window. If you resize the browser window, the number of lines in this paragraph will change.
</p>

</body>
</html>"""

print(html_parser(ex_doc2))
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://maksimdan.gitbook.io/interview-practice-problems/leetcode_sessions/trees-and-graphs/html-parser.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
