XpinDoc, a utility library for processing XML documents. Useful for generating output translations (such as groff, LATEX and HTML) from XML source files, as well as data capture applications.
XpinDoc is a programmer's library in C providing a framework for processing XML documents. This XpinDoc implementation is built "on top" of James Clark's expat XML parser. (The implementation is designed to be adaptable to other parsers; work in this area is in progress.)
The XpinDoc library is suitable for many common XML processing applications, such as document translation and data extraction. The progammer's interface offers a familiar, event-driven, user-definable callback architecture. The purpose of this interface it to simplify the application development effort to three steps:
The idea is to allow the developer to focus on the "logic" of the document itself. Compared to building an application up from scratch with a generic api such as SAX/SAX2, the XpinDoc library provides some advantages:
XpinDoc is designed primarily for processing stand alone documents. However, documents with external entities and default attribute definitions are easily pre-parsed with a supplemental external parser, such as xmllint from Daniel Veillard's libxml, or osx from the OpenSP/OpenJade project (originally the SP/nsgmls parser by James Clark). Such external parsers may also be useful to validate the document against an external DTD--or other validation scheme--as the document is authored. Final processing of the document may then be passed to the XpinDoc application. As an example, consider the following command pipe where xpinman is a XpinDoc application:
Here osx is used to validate the XML document and resolve external entities, xpinman performs the translation, and tidy further cleans up the html output.
Strengths:
Limitations:
#include <xpindoc.h> void my_html_filter(XpinDoc X, const char *data, int len) { char *c = (char *)data; while(len){ if(*c == '<') X->put_string(X, "<"); else if(*c == '>') X->put_string(X, ">"); else if(*c == '&') X->put_string(X, "&"); else X->put_char(X, *c); ++c; --len; } return; } void my_startdoc(XpinDoc xpin) { xpin->put_string(xpin, "<html>\n"); // ... } void my_title(XpinDoc xpin) { int event_type = xpin->event_type(xpin); if(event_type == XPIN_START_ELEMENT){ xpin->put_string(xpin, "<title>"); } if(event_type == XPIN_END_ELEMENT){ xpin->put_string(xpin, "</title>\n"); } } /* ... */ void my_enddoc(XpinDoc xpin) { xpin->put_string(xpin, "</html>"); } int main(int argc, char **argv) { XpinDoc xpin = new_xpindoc(XPIN_DEFAULT); if(xpin == NULL) my_die("error creating xpindoc"); /* set default mode to output: */ xpin->set_datamode(xpin, XPIN_OUTPUT); /* filter incoming chardata content for html output: */ xpin->set_filter(xpin, &html_filter); /* set the event handlers: */ xpin->set_handler(xpin, "Start_Document", &my_startdoc); xpin->set_handler(xpin, "<title/>", &my_title); xpin->set_handler(xpin, "/*/para/emph", &my_bolder); xpin->set_handler(xpin, "<?html-css?>, &my_css); xpin->set_handler(xpin, "End_Document", &my_enddoc); // ... /* run the parse: */ xpin->parse_stream(xpin, stdin); /* clean up: */ xpin->free(xpin); return 0; }
The above application snippet, saved in the file testxpin.c, can be compiled and linked with the XpinDoc library as follows:
Compile the code and install the library. You are now ready to write XpinDoc applications.
In the current release, the Makefile targets include:
Note: make targets for building a shared library, or for installing the library, documentation, etc., are not provided in this release. These steps may be easily performed "by hand" according to one's platform and system preferences.
Note also that the current release is dependent on the expat parser library, which is available separately. Expat should be installed before building XpinDoc.
A XpinDoc application is accessed and controlled through a top-level XpinDoc object:
#include <xpindoc.h> int my_main() { int flags = XPIN_NAMESPACE | XPIN_QNAME; XpinDoc X = new_xpindoc(flags); // ... }
The flags argument may be used to control the features of the parser object. Flags may be combined (OR'd) as shown. The flags currently implemented include:
A XpinDoc parse may be configured to operate in one of two modes, through the set_datamode() method:
int my_main() { int mode = XPIN_OUTPUT; // ... X->set_datamode(X, mode); // ... }
The mode argument may take one of the following values:
The set_mydata() method may be used to pass any arbitrary supplementary data to event handlers, where it may be retrieved by the mydata() method:
void my_handler(XpinDoc X) { //... mydata = (struct mydata *)X->mydata(X); //... } int my_main() { struct mydata *mydata = NULL; // ... X->set_mydata(X, mydata); // ... }
The heart of a XpinDoc application is the set_handler() method:
void my_handler(XpinDoc X) { //... } int my_main() { // ... err = X->set_handler(X, keystr, &my_handler); // ... }
Where the keystr argument is a nul-terminated constant character string taking one of the following forms:
Additionally, several default handlers may be installed by specifying keystr as one of the following exact (case-insensitive) strings:
To develop a XpinDoc application, it is necessary to understand the simple dispatch logic used in calling the installed handlers. By way of illustration, the following (ugh!) ascii chart sketches the flow of control that XpinDoc uses for processing a start element event:
_ Test Description Action --------- -------------------------- -------------------- 1. NAMESPACE is node namespace non-null and handler installed ? yes --> call handler --+ | no | \/ | | \/ 2. "/tagpath" is handler installed for node matching tagpath expression ? (LIST search) yes --> call handler --+ | no | \/ | | \/ 3. "<tag>" is handler installed for node matching tag ? (HASH search) yes --> call handler --+ | no | \/ | | \/ 4. default is a default handler installed ? yes --> call handler --+ | no | | | | | \/ | +--> (do nothing) --> continue parse
A brief explanation and rationale for the dispatch logic:
At most one handler will be called for an element.
(1.) If namespace-aware parsing is on, and if a "NS_Start_Element" handler is defined, and the current element has a non-null namespace URI the defined handler will be called for the element. That is, as described in the following XpinDoc snippet:
Xpin_Node N = X->node(X); if((X->ns_parser(X) != 0) && (N->ns_uri(N) != NULL)){ //...
This allows an application, if it chooses, to "filter out" (ALL!) elements not belonging to the native namespace, for a special handler.
(2.) If a handler is defined for a tagpath expression matching the current element, this handler will be called for the element. Tagpath expressions are able to specify an element's position and relationship to other elements in a document more specifically than the tagname. This allows fine-grained control to be applied before more general control.
As an example, consider a handler installed for the tagpath expression "/*/emph/emph", and another installed for the tagname "<emph>". The tagpath handler will be called for nested "<emph>" elements, while the tagname handler will catch other instances.
Note that tagpath handlers are installed in a list object, and items are tested in the same order as they are inserted. The first matching handler will be used for the element. This means that the application should install more specific tagname handlers before less specific handlers. That is, a handler for "/*/item/list" should be installed before "/*/list".
Note also that if namespacing processing is enabled with usage of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagpath expression for elements in a foreign namespace will include the namespace prefix. The application may install handlers for elements in a foreign namespace by specifying the prefix in the tagpath expression, such as "/*/book:para" or "/*/groff:tbl", etc. All elements with a particular namespace prefix may be handled by using a wildcard tagpath expression such as "/*/db:*".
(3.) If a handler is defined for the element's tagname, this handler will be called for the element. Tagname handlers are installed in hash objects, so handlers may be installed in any order.
Note also that if namespacing processing is enabled with usage of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagname for elements in a foreign namespace will include the namespace prefix. The application may then install handlers for elements in a foreign namespace by specifying the prefix in the tagname, such as "<html:table>", "<poem:verse>", etc.
(4.) Finally, if no handler for the element has yet been found, and a default handler has been installed for "Start_Element", then this handler will be used for the element. Otherwise the application will not call any handler for the element, and parsing will continue to the next event.
A XpinDoc object provides:
During the course of a parse, a XpinDoc application may access one or more of the following objects:
Each of these objects is described in its own section below.
To be continued...
XpinDoc isn't particularly innovative or ground-breaking. Historically, XpinDoc follows from an earlier SGML utility called "SpinDoc" that I implemented in Python. This, in turn, was influenced primarily by David Megginson's SGMLS.pm library in Perl. (Megginson's work, of course, going on to be highly influential in the development of the SAX.) XpinDoc has also been influenced by instant/transpec, Cost, and other SGML/XML tools.
Please see the source distribution for additional documentation and sample XpinDoc applications.