For the NSAttributedString+HTML Open Source project I chose to implement parsing of HTML with a set of NSScanner category methods. The resulting code is relatively easy to understand but has a couple of annoying drawbacks. You have to duplicate the NSData and convert it into an NSString effectively doubling the amount of memory needed. Then while parsing I am building an adhoc tree of DTHTMLElement instances adding yet another copy of the document in RAM.
When parsing HTML – and by extension XML – you have two kinds of operating mode available: you can have the Sequential Access Method (SAX) where walking through the document triggers events on the individual pieces of it. The second method is to build a tree of nodes, a Document Object Model (DOM). NSScanner lends itself to SAX, but in this case it is less than ideal because for CSS inheritance some sort of hierarchy is necessary to walk up on.
In this post we will begin to explore the industry-standard libxml library and see how we can thinly wrap it in Objective-C that it plays nicely with our code.