XpinDoc
An XML parse instance, document
processing utility library. Introduction
XpinDoc is a programmer's library in C providing a
framework for processing XML documents. This XpinDoc implementation
may be built "on top" of either James Clark's expat parser,
or Richard Tobin's RXP parser. (The implementation is
designed to be easily adapted to other parsers; further work in
this area is coming soon.)
The XpinDoc library is suitable for many common XML processing
applications, such as document translation and data extraction. The
progammer's interface offers a familiar, event-driven,
user-definable callback architecture. The purpose of this interface
it to simplify the application development effort to three
steps:
- define methods for specific events in the document
- install them in the XpinDoc object
- let XpinDoc rip
The idea is to allow the developer to focus on the "logic" of
the document itself. Compared to building an application up from
scratch with a generic api such as SAX/SAX2, the XpinDoc library
provides some advantages:
- built-in dispatch mechanism to user-designated handlers
- node container objects, with all element, attribute, namespace,
and content data readily available
- accessible hierarchical list of parent element nodes, from
current element to document root
- installable chardata input filter
- flexible capture or output modes; even in output mode,
application can capture chardata within the "text" object provided
by the current node
- stack-based output "writer" object: direct chardata to output,
file, pipe, text object, or null
- "tagpath" element node descriptor, using fnmatch()-style
wildcard and comparison methods
- supports alternative/customized "xpin_reader" interfaces;
source input need not even be XML
- etc...
Using expat as the default XML parser, XpinDoc is designed
primarily for processing stand alone documents. However, documents
with external entities and default attribute definitions are easily
pre-parsed with a supplemental external parser, such as
xmllint from Daniel Veillard's libxml, rxp from
Richard Tobin's RXP, or osx from the OpenSP/OpenJade project
(originally the SP/nsgmls parser by James Clark). Such external
parsers may also be useful to validate the document against an
external DTD--or other validation scheme--as the document is
authored. Final processing of the document may then be passed to
the XpinDoc application. As an example, consider the following
command pipe where xpinman is a XpinDoc application:
osx xpindoc_manual.xml | xpinman | tidy >
xpindoc_manual.html Here osx is used to validate the XML document and resolve
external entities, xpinman performs the translation, and tidy
further cleans up the html output.
Strengths:
- for those who prefer C
- lightweight, portable, fast
- namespace support
- ability to "lock-in" XML processing applications and distribute
standalone binaries
- may be built with alternative parsers
- tested on FreeBSD, OpenBSD and Linux platforms
Limitations:
- present implementation does not support "wide" characters
- build/install engineering needs work
- this documentation is pathetic
Synopsis
#include
void
my_html_filter(XpinDoc X, const char *data, int len)
{
char *c = (char *)data;
while(len){
if(*c == '<') X->put_string(X, "<");
else if(*c == '>') X->put_string(X, ">");
else if(*c == '&') X->put_string(X, "&");
else X->put_char(X, *c);
++c;
--len;
}
return;
}
void
my_startdoc(XpinDoc xpin)
{
xpin->put_string(xpin, "\n");
// ...
}
void
my_title(XpinDoc xpin)
{
int event_type = xpin->event_type(xpin);
if(event_type == XPIN_START_ELEMENT){
xpin->put_string(xpin, "");
}
if(event_type == XPIN_END_ELEMENT){
xpin->put_string(xpin, "\n");
}
}
/* ... */
void
my_enddoc(XpinDoc xpin)
{
xpin->put_string(xpin, "");
}
int
main(int argc, char **argv)
{
XpinDoc xpin = new_xpindoc(XPIN_DEFAULT);
if(xpin == NULL)
my_die("error creating xpindoc");
/* set default mode to output: */
xpin->set_datamode(xpin, XPIN_OUTPUT);
/* filter incoming chardata content for html output: */
xpin->set_filter(xpin, &html_filter);
/* set the event handlers: */
xpin->set_handler(xpin, "Start_Document", &my_startdoc);
xpin->set_handler(xpin, "", &my_title);
xpin->set_handler(xpin, "/*/para/emph", &my_bolder);
xpin->set_handler(xpin, "", &my_css);
xpin->set_handler(xpin, "End_Document", &my_enddoc);
// ...
/* run the parse: */
xpin->parse_stream(xpin, stdin);
/* clean up: */
xpin->free(xpin);
return 0;
}
The above application snippet, saved in the file
testxpin.c, may be compiled and linked with the XpinDoc
library as follows:
gcc -Wall -O2 -o testxpin testxpin.c -lxpindoc
Installation
Compile the code and install the library. You are now ready to
write XpinDoc applications.
In the current release, the Makefile targets include:
- "make library", to build a static library
- "make xpinman", to build the XpinDoc application for processing
the documentation
- "make manual", to translate the manual into HTML format
Note: make targets for building a shared library, or for
installing the library, documentation, etc., are not provided in
this release. These steps may be easily performed "by hand"
according to one's platform and system preferences.
Note also that the current release requires at least one of the
supported XML parsers, expat or RXP, which are available
separately. These libraries should be built before building
XpinDoc. The default Makefile with XpinDoc supports the expat
library; see Makefile.rxp and README.rxp for working with the RXP
parser.
Getting Started
A XpinDoc application is accessed and controlled through a
top-level XpinDoc object:
#include
int
my_main()
{
int flags = XPIN_NAMESPACE | XPIN_QNAME;
XpinDoc X = new_xpindoc(flags);
// ...
}
The flags argument may be used to control the features of the
parser object. Flags may be combined (OR'd) as shown. The flags
currently implemented include:
- XPIN_DEFAULT
- No flags are set.
- XPIN_NAMESPACE
- Enables namespace-aware parsing.
- XPIN_QNAME
- If namespace processing is enabled, the qualified element name
(prefix:local_name) is used for the tagname and tagpath expressions
of elements in a foreign namespace; otherwise, the local_name is
used.
A XpinDoc parse may be configured to operate in one of two
modes, through the set_datamode() method:
int
my_main()
{
int mode = XPIN_OUTPUT;
// ...
X->set_datamode(X, mode);
// ...
}
The mode argument may take one of the following values:
- XPIN_CAPTURE
- Content is "captured" to Text object of current node
(default)
- XPIN_OUTPUT
- Content is output through the current Writer object
The set_mydata() method may be used to pass any arbitrary
supplementary data to event handlers, where it may be retrieved by
the mydata() method:
void
my_handler(XpinDoc X)
{
//...
mydata = (struct mydata *)X->mydata(X);
//...
}
int
my_main()
{
struct mydata *mydata = NULL;
// ...
X->set_mydata(X, mydata);
// ...
}
The heart of a XpinDoc application is the set_handler()
method:
void
my_handler(XpinDoc X)
{
//...
}
int
my_main()
{
// ...
err = X->set_handler(X, keystr, &my_handler);
// ...
}
Where the keystr argument is a nul-terminated constant character
string taking one of the following forms:
- ""
- handler is installed for start element events with tagname
matching tag
- ""
- handler is installed for end element events with tagname
matching tag
- ""
- handler is installed for start and end element events with
tagname matching tag
- "/tagpath"
- handler is installed for start and end element events with a
tagpath matching "/tagpath"
- ""
- handler is installed for processing instruction event with
target matching pi
Additionally, several default handlers may be installed by
specifying keystr as one of the following exact (case-insensitive)
strings:
- "Start_Document"
- handler is installed for the start of the document
- "End_Document"
- handler is installed for the end of the document
- "Start_Element"
- handler is installed as the default handler for start element
events
- "End_Element"
- handler is installed as the default handler for end element
events
- "NS_Start_Element"
- handler is installed as the default handler for start element
events for elements having a non-null namespace URI (requires
namespace processing enabled)
- "NS_End_Element"
- handler is installed as the default handler for end element
events for elements having a non-null namespace URI (requires
namespace processing enabled)
- "Processing_Instruction"
- handler is installed as the default handler for processing
instruction events
- "Start_CDATA"
- handler is installed for the start of CDATA blocks
- "End_CDATA"
- handler is installed for the end of CDATA blocks
- "ERROR"
- handler is installed for error events raised by XpinDoc
Handler Dispatch Logic
To develop a XpinDoc application, it is necessary to understand
the simple dispatch logic used in calling the installed handlers.
By way of illustration, the following (ugh!) ascii chart sketches
the flow of control that XpinDoc uses for processing a start
element event:
_
Test Description Action
--------- -------------------------- --------------------
1. NAMESPACE is node namespace non-null
and handler installed ? yes --> call handler --+
|
no |
V
|
|
V
2. "/tagpath" is handler installed for
node matching tagpath
expression ?
(LIST search) yes --> call handler --+
|
no |
V
|
|
V
3. "" is handler installed for
node matching tag ?
(HASH search) yes --> call handler --+
|
no |
V
|
|
V
4. default is a default handler
installed ? yes --> call handler --+
|
no |
|
| |
| V
|
+--> (do nothing) --> continue parse
A brief explanation and rationale for the dispatch logic:
At most one handler will be called for each element event.
(1.) If namespace-aware parsing is on, and if a
"NS_Start_Element" handler is defined, and if the current element
has a non-null namespace URI, the defined handler will be called
for the element. That is, as described in the following XpinDoc
snippet:
Xpin_Node N = X->node(X);
if((X->ns_parser(X) != 0) && (N->ns_uri(N) != NULL)){
//...
This allows an application, if it chooses, to "filter out"
(ALL!) elements not belonging to the native namespace, for a
special handler.
(2.) If a handler is defined for a tagpath expression matching
the current element, this handler will be called for the element.
Tagpath expressions are able to specify an element's position and
relationship to other elements in a document more specifically than
the tagname. This allows fine-grained control to be applied before
more general control.
As an example, consider a handler installed for the tagpath
expression "/*/emph/emph", and another installed for the tagname
"". The tagpath handler will be called for nested
"" elements, while the tagname handler will catch other
instances.
Note that tagpath handlers are installed in a list object, and
items are tested in the same order as they are inserted. The first
matching handler will be used for the element. This means that the
application should install more specific tagname handlers before
less specific handlers. That is, a handler for "/*/item/list"
should be installed before "/*/list".
Note also that if namespacing processing is enabled with usage
of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagpath
expression for elements in a foreign namespace will include the
namespace prefix. The application may install handlers for elements
in a foreign namespace by specifying the prefix in the tagpath
expression, such as "/*/book:para" or "/*/groff:tbl", etc. All
elements with a particular namespace prefix may be handled by using
a wildcard tagpath expression such as "/*/db:*".
A handler installed with a tagpath expression of "/*" will act
as a default handler for any elements not previously matched. Note
that such a handler would effectively prevent any tagname handler
from being called.
(3.) If a handler is defined for the element's tagname, this
handler will be called for the element. Tagname handlers are
installed in hash objects, so handlers may be installed in any
order.
Note also that if namespacing processing is enabled with usage
of qualified names (XPIN_NAMESPACE | XPIN_QNAME), the tagname for
elements in a foreign namespace will include the namespace prefix.
The application may then install handlers for elements in a foreign
namespace by specifying the prefix in the tagname, such as
"", "", etc.
(4.) Finally, if no handler for the element has yet been found,
and a default handler has been installed for "Start_Element", then
this handler will be used for the element. Otherwise the
application will not call any handler for the element, and parsing
will continue to the next event.
XpinDoc Objects
A XpinDoc object provides:
- a parser for the document instance
- a registry for callback handlers
- access to objects generated during the parse
During the course of a parse, a XpinDoc application may access
one or more of the following objects:
- Xpin_Event
- An XML event and its associated data, such as:
- start element
- end element
- processing instruction
- etc.
- Xpin_Node
- Container object with access methods to the character data
within an XML element, also providing:
- tagname, and namespace information
- XpinDoc "tagpath"
- element attributes
- element contents
- Xpin_PI
- An XML processing instruction.
Each of these objects is described in its own section below.
To be continued...
History
XpinDoc isn't particularly innovative or ground-breaking.
Historically, XpinDoc follows from an earlier SGML utility of mine
called "SpinDoc" implemented in Python. This, in turn, was
influenced primarily by David Megginson's SGMLS.pm library in Perl.
(Megginson's work, of course, going on to be highly influential in
the development of the SAX.) XpinDoc has also been influenced by
instant/transpec, Cost, and other SGML/XML tools.
Conclusions
Please see the
source distribution
for additional documentation
and sample XpinDoc applications.
|