HTML Parser

How would you parse out HTML tags from a string?

Example:

input:
<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>
<p>My first paragraph.</p>

</body>
</html>

---

output:
[('h1', 'My First Heading'), 
('p', 'My first paragraph.'), 
('body', '\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n'), 
('html', '\n<body>\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n</body>\n'),

Note: This is not meant to be a completely robust implementation. It obviously does not parse every aspect of html.

The Idea: The key is to identify the grammar of an html tag, and use a stack to facilitate the deconstruction of the string. An html tag has the following simple grammar: html ::= '<' tag '>' exp '</' tag '>' where tag and exp are arbitrary strings. Each token can be uniquely identified on the basis of what came before it within the stack.

Complexity: O(n) time and O(n) space (consider the html string: '<a>b</a>')

Testing

Last updated

Was this helpful?