Taming HTML Parsing with libxml (1)

For the NSAttributedString+HTML Open Source project I chose to implement parsing of HTML with a set of NSScanner category methods. The resulting code is relatively easy to understand but has a couple of annoying drawbacks. You have to duplicate the NSData and convert it into an NSString effectively doubling the amount of memory needed. Then while parsing I am building an adhoc tree of DTHTMLElement instances adding yet another copy of the document in RAM. When parsing HTML – and by extension XML – you have two kinds of operating mode available: you can have the Sequential Access Method (SAX) where walking through the document triggers events on the individual pieces of it. The second method is to build a tree of nodes, a Document Object Model (DOM). NSScanner lends itself to SAX, but in this case it is less than ideal because for CSS inheritance some sort of hierarchy is necessary to walk up on. In this post we will begin to explore the industry-standard libxml library and see how we can thinly wrap it in Objective-C that it plays nicely with our code. Getting libxml into your Xcode project is straightforward. Fortunately for us libxml is so old and established that you can find it already installed on Unix, Mac and iOS platforms. There are two kinds of libraries in C: static and dynamic. libxml is the latter which you can recognize by the .dylib extension. Adding the Library First we need to add the library providing all the XML and HTML structures and functions. We are actually using version 2.2 of libxml, the file libxml2.dylib is a symbolic link to libxml2.2.dylib. Next – because libxml is not a framework that would package the necessary headers with it – we also need to tell Xcode where the headers can be found. Since libxml also comes with OSX, its headers – just like all other OSX system libraries can be found in /usr/include. Add /usr/include/libxml2  to the Header Search Paths and we’re set. Now all we need to do to access libxml’s parsing methods and data structures is to add the appropriate import. Most of the internal structures are shared between the XML and HTML parsers and so we just need the HTMLparser header. #import <libxml/HTMLparser.h> Document Structure Before we get into parsing let me show you how libxml represents HTML documents. Everything in libxml is a node. Because C does not have a concept of objects the classical method of representing a tree is by having C structs that have member variables pointing to other structs. A child is just a pointer to the child struct/node. If there can be more than one item, i.e. a list, this is represented by a linked list where the first node points to the next and so on until the very last node has a NULL pointer. The smallest unit in libxml is xmlNode structure which is defined as such: /** * xmlNode: * * A node in an XML tree. */ typedef struct _xmlNode xmlNode; typedef xmlNode *xmlNodePtr; struct _xmlNode { void *_private; /* application data */ xmlElementType type; /* type number, must be second ! */ const xmlChar *name; /* the name of the node, or the entity */ struct _xmlNode *children; /* parent->childs link */ struct _xmlNode *last; /* last child link */ struct _xmlNode *parent; /* child->parent link */ struct _xmlNode *next; /* next sibling link */ struct _xmlNode *prev; /* previous sibling link */ struct _xmlDoc *doc; /* the containing document */   /* End of common part */ xmlNs *ns; /* pointer to the associated namespace */ xmlChar *content; /* the content */ struct _xmlAttr *properties;/* properties list */ xmlNs *nsDef; /* namespace definitions on this node */ void *psvi; /* for type/PSVI informations */ unsigned short line; /* line number */ unsigned short extra; /* extra data for XPath/XSLT */ }; The useful links depicted in the above chart as children, last, parent, next, prev and doc. The type value is the kind of role this node plays. If it is a tag then it is an XML_ELEMENT_NODE. The contents of a tag is represented by an XML_TEXT_NODE. Attributes are XML_ATTRIBUTE_NODE. Note that even if the original HTML does not contain a DTD, html or body tag these will be implied by the parser. Let’s Parse Already I sense that you grow impatient with me. Ok ok, we’re getting right to it now that you understand how libxml represents DOMs. Assume we have some HTML data downloaded from the web, the NSURL of it is in _baseURL. // NSData data contains the document data // encoding is the NSStringEncoding of the data // baseURL the documents base URL, i.e. location   CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding); CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc); const char *enc = CFStringGetCStringPtr(cfencstr, … Continue reading Taming HTML Parsing with libxml (1)