HTML Parser
How would you parse out HTML tags from a string?
Example:
input:
<!DOCTYPE html>
<html>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>
---
output:
[('h1', 'My First Heading'),
('p', 'My first paragraph.'),
('body', '\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n'),
('html', '\n<body>\n\n<h1>My First Heading</h1>\n<p>My first paragraph.</p>\n\n</body>\n'),Note: This is not meant to be a completely robust implementation. It obviously does not parse every aspect of html.
The Idea: The key is to identify the grammar of an html tag, and use a stack to facilitate the deconstruction of the string. An html tag has the following simple grammar: html ::= '<' tag '>' exp '</' tag '>' where tag and exp are arbitrary strings. Each token can be uniquely identified on the basis of what came before it within the stack.
Complexity: O(n) time and O(n) space (consider the html string: '<a>b</a>')
Testing
Last updated
Was this helpful?