Parsing HTML with Nim using NimPath

2023-07-11 00:00 | Wesley Kerfoot

This is a quick guide on how to use my library NimPath to extract information from an HTML document, using XPath.

If you are not familiar with XPath, you can learn more by searching the web for introductory material. In short, XPath is a query language that allows you to write complex queries against HTML or XML documents in order to extract nodes. This can be very useful for many different things (e.g. writing automated tests, web scraping, validation, etc).

Simple example

Here is an example of how to use the library. I will explain each part.:

var parsed = parseHTML("<html><body><h1>foobar</h1></body></html>", "")

var nodes : seq[HTMLNode] = toSeq(xpathQuery(parsed, "//*"))

assert map(nodes, (n) => n.node_name.get) == @["html", "body", "h1"]

for node in xpathQuery(parsed, "//*"):
    if node.node_name.isSome and node.node_name.get == "h1":
        assert node.textContent.get == "foobar"

The first line uses the parseHTML function to parse the HTML. We then pass it to xpathQuery along with an XPath expression (in this case it gets every node). xpathQuery is an iterator, so it must be converted to a sequence first with toSeq.

The next part is simply an assertion that checks that we did indeed get every node, by looking at the name of each node returned. Note that xpathQuery will return all nodes found in a linear order, although the HTML structure is actually nested. It is possible to get nodes that contain nodes that appear elsewhere in the sequence.

We are able to access nodes inside the body of the iterator, or as part of the sequence if we convert it to a sequence. It is important to know that the iterator will take care of cleaning things up when it's finished, therefore anything accessed in the body of the iterator must be copied into some other variable if you wish to keep it around.

Example showing querying subnodes

A very useful feature is that it is possible to query subnodes that are returned inside an xpathQuery iterator with a new XPath expression. You can either use queryWithContext or getSingleWithContext which just returns the first result found from queryWithContext.

This is very convenient if you want to break up your XPath expressions to match on subnodes after doing some programmatic processing on them for some reason.:

var nodes : seq[HTMLNode] = toSeq(xpathQuery(parsed, "//*"))
for node in xpathQuery(parsed, "//div"):
  for subnode in queryWithContext(node, ".//*"):
    # assume the document has only span nodes under all div nodes
    assert $subnode.node.name == "span"

Note that you must include the . at the beginning of the XPath expression for the subnodes in order to indicate that you are using the current context which is the subnode

See the tests for more examples of how to use NimPath