Showing posts with label XML. Show all posts
Showing posts with label XML. Show all posts

Test hello rapidxml

This include-only C++ library is a bit outdated, still it is commonly used when the job be done fast and requirements are not too sophisticated. You can get the source code from sourceforge were you could find also a slim technical manual.

There are at least a couple of rapidxml characteristics you should be aware before start working with it.

Rapidxml parsing is destructive. The xml_document::parse() method gets in input a non-constant C-string of characters, that it uses as an its own internal buffer. If you want to keep your XML as it is, you'd better pass in a copy of it.

Preconditions are usually checked with assertions. Exceptions are thrown from the xml_document::parse() method only. Be careful in testing what you are passing to an asserting function (for instance, xml_node::last_node() requires the node to have at least a child (it asserts its first_node is not NULL), and try/catching the parse call.

I have written a test case (using the Google Test framework) that shows how to parse a simple XML and to read the information in it. Notice that I just read a document, without performing any editing on it, this keeps the example simple enough.
#include "rapidxml/rapidxml.hpp"
#include <gtest/gtest.h>

TEST(RapidXml, simple)
{
  char buffer[] = "<root><first>one</first><second>two</second><third>whatever</third></root>"; // 1

  rapidxml::xml_document<char> doc; // 2
  ASSERT_NO_THROW(doc.parse<0>(buffer)); // 3

  rapidxml::xml_node<char>* root = doc.first_node(); // 4
  ASSERT_TRUE(root);
  ASSERT_STREQ("root", root->name()); // 5

  bool fields[4] {}; // 6
  for(rapidxml::xml_node<char>* node = root->first_node(); node != NULL; node = node->next_sibling()) // 7
  {
    if(strcmp(node->name(), "first") == 0) // 8
    {
      ASSERT_STREQ("one", node->value());
      fields[0] = true;
    }
    else if(strcmp(node->name(), "second") == 0)
    {
      ASSERT_STREQ("two", node->value());
      fields[1] = true;
    }
    else if(strcmp(node->name(), "third") == 0) // 9
    {
      fields[2] = true;
    }
    else // 10
    {
      fields[3] = true; // unexpected!
      std::cout << "Unexpected node: " << node->name() << std::endl;
    }
  }

  EXPECT_TRUE(fields[0]); // 11
  EXPECT_TRUE(fields[1]);
  EXPECT_TRUE(fields[2]);
  EXPECT_FALSE(fields[3]);
}
1. Remember that rapidxml is going to change this C-string (NULL-terminated array of characters) for its own purposes.
2. The xml_document template class has a template parameter that defaults to char. If you want to save some typing you can rewrite this line without specifying the parameter, and using the char default:
rapidxml::xml_document<> doc
3. xml_document::parse() expects an int as template parameter, pass zero to get the default behavior. In your code you should try/catch this call for rapidxml::parse_error exception (it extends the std::exception). Here I assert that it should not throw.
4. xml_document IS-A xml_node, so I call on doc the xml_node::first_node() method to get the first document child. If doc has no child, first_node() returns a NULL pointer, otherwise we have a pointer to that node.
5. I expect the root to be there, so I assert that it is not zero (AKA false), then I get its name and I assert it is as expected. xml_node IS-A xml_base, where we can see that the name() method never returns NULL, if the node has no name, an empty C-string is returned instead.
6. Root has three children. I want to ensure I see all of them and nothing more. This bunch of booleans keeps track of them. They are all initialized to false (through the handy C++ empty list initializer) and then, in the following loop, when I see one of them I set the relative flag to true. There are four booleans, and not three, because I want to flag also the case of an unexpected child.
7. The for-loop is initialized getting the first root child, then we get the next sibling, until we reach the end of the family (a NULL is returned). We should pay attention using xml_node::next_sibling(), since it asserts when the current node has no parent. But here we call next_sibling() on a node that is surely a children of another node.
8. For first and second node, we want to ensure it has a specific value, hence the assertion.
9. The third node could have any value, I just set the flag when I see it.
10. In case an unexpected node is detected, I keep track of this anomaly setting the relative flag.
11. Check if the expectations are confirmed.

Go to the full post

From DOM to file by DOMLSSerializer

We have used XercesDOMParser to parse an XML in a DOM document, so that we could access and modify it programmatically.

If we want to work our way the other way round, getting a text representation of a DOM document, we can use DOMLSSerializer.

Actually, the code is not as simple as one could expect. I tried to keep it as short as I could, and here is the result:

void dumpDom(DOMNode* node) // 1.
{
DOMImplementationLS* impl = (DOMImplementationLS*)
DOMImplementationRegistry::getDOMImplementation(L"LS"); // 2.
if(impl == 0)
return;

DOMLSSerializer* serializer = impl->createLSSerializer(); // 3.
if(serializer == 0)
return;

StdOutFormatTarget ft; // 4.
DOMLSOutput* output = impl->createLSOutput(); // 5.
output->setByteStream(&ft); // 6.

try {
std::cout << "---" << std::endl;
if(node)
serializer->write(node, output); // 7.
std::cout << std::endl << "---" << std::endl;
}
catch(const XMLException& xe) {
std::wcout << "XML Exception: " << xe.getMessage() << std::endl;
}
catch(const DOMException& de) {
std::wcout << "DOM Exception: " << de.getMessage() << std::endl;
}
catch (...) {
std::cout << "Unexpected Exception" << std::endl;
}

output->release(); // 8.
serializer->release();
}

1. As input parameter the function expect a XML DOM node that would be dumped with all its children and grand-children (and so on). If we pass to this function the document node, as returned by the XercesDOMParser::getDocument(), for instance, the entire XML DOM document is printed.
2. We specify LS (Load and Save) as the feature the DOM implementation we retrieve should implement, and use the returned value as LS interface.
3. We ask to the implementation a serializer, the guy delegated to convert the DOM in a human readable form.
4. StdOutFormatTarget implements XMLFormatTarget for std::cout.
5. We should use a DOMLSOutput as output destination.
6. In this way we say that the output should go to std::cout
7. The real job is done in this line (!)
8. We release both output and serializer, to indicate that they are not in use anymore.

Go to the full post

SAX parsing of character content

We override DocumentHandler::characters() to let the Xerces SAX parser take some action with the character content of our XML document.

The basic idea is the same that we have seen in the previous posts, but we have to pay attention to the fact that there is no guarantee a single call to character() completes the management for an element. This imply we have to add some logic to our handler class to implement it correctly.

As an example we use an XML document like this:

<?xml version="1.0" encoding="UTF-8"?>
<train>
<car type="Engine">
<color>Black</color>
<!-- ... more stuff here -->
</car>
<car type="Baggage">
<color>Green</color>
<weight>80 tons</weight>
<!-- ... more stuff here -->
</car>

<!-- ... more stuff here -->

<car type="Caboose">
<color>Red</color>
<!-- ... more stuff here -->
</car>
</train>

As a result we would like to have this output to the standard console:

Engine has color Black
Baggage has color Green
...
Caboose has color Red

To get this, we rewrite our SimpleHandler, adding a few private data member, to keep track of the current element and its character content, rewriting the startElement() method, and adding two new methods, characters() and endElement().

Here are the changes:

// ...

namespace
{
const XMLCh* const ELEM_CAR = L"car";
const XMLCh* const ATTR_TYPE = L"type";
const XMLCh* const ELEM_COLOR = L"color";
}

class SimpleHandler : public HandlerBase
{
// ...

private:
bool isColor; // 1.
std::wstring carType; // 2.
std::wstring carColor; // 3.

public:
SimpleHandler() : isColor(false) {} // 4.

/**
* override HandlerBase::startElement(name, attrs)
*/
void startElement(const XMLCh* const name, AttributeList& attrs)
{
if(wcscmp(name, ELEM_CAR) == 0) // 5.
{
const XMLCh* const type = attrs.getValue(ATTR_TYPE);
if(type != 0)
carType = type;
}
else if(wcscmp(name, ELEM_COLOR) == 0) // 6.
{
isColor = true;
}
}

/**
* override HandlerBase::characters(buffer, size)
*/
void characters(const XMLCh* const buffer, const XMLSize_t size)
{
if(isColor) // 7.
carColor += buffer;
}

/**
* override HandlerBase::endElement(name)
*/
void endElement(const XMLCh* const name)
{
if(isColor) // 8.
{
std::wcout << carType.c_str() << " has color " << carColor.c_str() << std::endl;
isColor = false;
carColor.clear();
}
}

1. isColor is used to signal when the parser is working with a "color" element.
2. carType is the wide character string where we locally store the value for the "type" attribute for the current "car" element.
3. carColor is the wide character string for the current "color" character content.
4. Until when explicitly required, we assume no "color".
5. If the starting element is a "car" we try to get the value of its "type" attribute. If we succeed, we store it in carType (2).
6. Otherwise, we check if the starting element is a "color".
7. We are in a "color" element, append the current chunck of character to the carColor wide string.
8. SAX parser is evaluating the end tag a "color" element: we output the generated string containing its character content, and then clean the color local state.

More details on Xerces (but Java implementation) and SAX on chapter 12 of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

SAX parser and StartElement

Now we are about to modify our Xerces-C SAX parsing example to let the parser react at the starting of a new element.

Given the way the SAXParser is designed, we just have to change our callback class that implements DocumentHandler (actually, we extends HandlerBase that derives from it) overriding the startElement() method.

Our function must have this declaration:
void startElement(const XMLCh* const, AttributeList&)
where the first parameter is the element name, and the second is the list, possibly empty, of the associated attributes.

Here is the change in the code:

class SimpleHandler : public HandlerBase
{
public:
// ...

// ... we add this new function:

/**
* override HandlerBase::startElement(name, attrs)
*/
void startElement(const XMLCh* const name, AttributeList& attrs)
{
if(wcscmp(name, L"car") == 0 && attrs.getLength() == 1) // 1.
{
std::wcout << "Start a car: " << attrs.getName(0) << // 2.
" [" << attrs.getType(XMLSize_t(0)) << // 3.
"] = \"" << attrs.getValue(XMLSize_t(0)) << '\"' << std::endl;
}
else
{ // 4.
std::wcout << "Start element: " << name << std::endl;
for(XMLSize_t i = 0; i < attrs.getLength(); ++i)
{
std::wcout << "Attribute " << attrs.getName(i) <<
" [" << attrs.getType(i) <<
"] = \"" << attrs.getValue(i) << '\"' << std::endl;
}
}
}
};

1. this piece of code is called only for specific elements: the ones having name "car". Notice that Xerces works with wide character string, so we use wcscmp() instead of the plain strcmp(). In the second part of the condition we ensure that the element has one and only one attribute using the AttributeList::getLength() method.
2. AttributeList::getName() returns the name of the attribute specified by index.
3. AttributeList::getType() and AttributeList::getValue() are a bit trickier, because both of them are overloaded, and could be called passing the index or the name of the attribute. Nuisance is that we have to specify explicitely the type, we can't pass just the constant 0, otherwise the compiler wouldn't know if we mean it as a NULL pointer or a index.
4. Generic case: we output the element name and all its attributes, if any.

More details on SAX (referring to the Xerces-J implementation) on chapter 12 of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

SAX parsing with Xerces

Besides a XML DOM parser, Xerces make available a SAX parser too.

Using the DOM parser we have access to the complete XML document, and we can navigate through it as we wish - so it is usually the best option for small XML and when we want to have full control over it.

Using the SAX parser, on the other way, works nice when the size of the XML document is so big that it gets unpractical to use the DOM one, that has to load all of it in memory before letting us doing our job.

We create instead a class specifying the behaviour we want to be accomplished when some specific event is generated by the SAX parser, and we pass an instance of this class to SAX, letting it calling back our methods.


Say that we want just to be acknowledged of the fact that SAX finds the start and the end of the passed XML document. We create a class that extends HandlerBase:

#include <xercesc/sax/HandlerBase.hpp>
#include <iostream>

XERCES_CPP_NAMESPACE_USE

class SimpleHandler : public HandlerBase
{
public:
/**
* override HandlerBase::startDocument()
*/
void startDocument()
{
std::cout << "Start document" << std::endl;
}

/**
* override HandlerBase::endDocument()
*/
void endDocument()
{
std::cout << "End document" << std::endl;
}
};

And we use an instance of this class in a function like this:

#include <xercesc/parsers/SAXParser.hpp>

XERCES_CPP_NAMESPACE_USE

void saxParse(const XMLCh* filename) // 1.
{
SAXParser parser;
SimpleHandler handler;
parser.setDocumentHandler(&handler); // 2.

try {
parser.parse(filename); // 3.
}
catch (const XMLException& xe) {
std::wcerr << "XML Exception: " << xe.getMessage() << std::endl;
return;
}
catch (const SAXParseException& se) {
std::wcerr << "SAX Parse Exception: " << se.getMessage() << std::endl;
return;
}
catch (...) {
std::cerr << "Unexpected Exception" << std::endl;
return;
}
}

1. As usual in Xerces, we use wide character strings - XMLCh is a define for wchar_t.
2. Here we pass our custom handler object to the parser - before actually parsing the file - so that it can call back our methods to perform the required functionality at the expected time.
3. Finally, we parse the XML file, ready to catch any possible exception, even the unexpected ones.

The required Xerces initialization and termination is shown in a previous post.

More details on SAX on chapter 12 of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

Root element name

As we have seen while doing the basic setup for Xerces, this framework is quite low level.

We have a confirmation of this impression when we try to write a function that just gets the name of a root element from an XML file.

Here is the code of such a function, that should be call after Xerces has been initialized, and before it is terminated:

#include <iostream>

#include <xercesc/dom/DOM.hpp>
#include <xercesc/parsers/XercesDOMParser.hpp>
#include <xercesc/sax/SAXException.hpp>

XERCES_CPP_NAMESPACE_USE

void dumpRootName(const XMLCh* filename) // 1.
{
try
{
XercesDOMParser parser; // 2.
parser.parse(filename);
if(parser.getErrorCount() != 0)
std::wcerr << parser.getErrorCount() << " errors parsing " << filename << std::endl; // 3.
else
{ // 4.
DOMNode* doc = parser.getDocument();
DOMNode* root = doc->getFirstChild();
std::wcout << "Root name: " << root->getNodeName() << std::endl;
}
}
catch (const DOMException& e) // 5.
{
std::wcerr << "DOM Exception on " << filename << ": " << e.code << std::endl;
}
catch (const XMLException& e)
{
std::wcerr << "XML Exception on " << filename << ": " << e.getMessage() << std::endl;
}
catch (const SAXException& e)
{
std::wcerr << "SAX Exception on " << filename << ": " << e.getMessage() << std::endl;
}
catch (...)
{
std::wcerr << "Unexpected exception on " << filename << std::endl;
}
}

1. XMLCh is actually a synonim of wchar_t, so the function has as input parameter a wide char pointer to the filename in which is stored the XML we are interested in.
2. Actually we can't complain too much, Xerces provides us a lot of useful classes, among them XercesDOMParser is a highly configurable DOM parser that is what we need for our current task. Here we use it in its default configuration, that works just fine.
3. As we said, the filename is a wide char string, so we use a wide console stream to output it.
4. If no error has been detected, we extract the name from the root element. Since the parser succeeded, getDocument() should give back a valid XML document, seen as a DOM node that has one child node, the document root.
5. XercesDOMParser could throw DOMException, XMLException, SAXException, and we should always be prepared to catch something unexpected.

Go to the full post

RAII for Xerces

We have already seen how to install and run Apache Xerces-C 3 on Windows Seven / VC++ 2010.

Before starting using it, let's simplify a bit our life creating a little class that would spare us the bore of initialize and terminate explicitly Xerces.

The rationale behind it is that Xerces is a resource that has to be initialized before using it and released at the end, so it makes perfect sense applying the RAII (Resource Acquisition Is Initialization) paradigm.

Besides, it is very easy to design and implement. It just a matter of writing this tiny class:
#include <iostream>
#include <xercesc/util/PlatformUtils.hpp>
XERCES_CPP_NAMESPACE_USE

class XercesManager
{
public:
   XercesManager()
   {
      std::cout << "Initializing Xerces" << std::endl;
      XMLPlatformUtils::Initialize();
   }

   ~XercesManager()
   {
      std::cout << "Terminating Xerces" << std::endl;
      XMLPlatformUtils::Terminate();
   }
};
Given this wrapper class, our main becomes:
int main(int argc, char* argv[])
{
   try
   {
      XercesManager xm;
      someFunction(argc, argv); 
   }
   catch(const XMLException& ex)
   {
      std::cout << "Failure on Xerces: " << ex.getMessage() << std::endl;
   }

   system("pause");
   return 0;
}
Where in someFunction() there will be the actual code requiring Xerces.

We put on the stack a XercesManager instance. Its allocation determine the Xerces initialization; and when we leave the scope (both in case of exception and regular termination) the termination call is made through the destructor.

And now we can focus on the real job.

Go to the full post

XSLT in the mode for multiple usage

We are using XSLT and Saxon to generate an HTML file from an XML document.

Here the issue is that we want to process more than once a node from the XML in input. Can we do that? Yes, if we specify a "mode" attribute in the xsl:apply-templates and xsl:template elements.

As an example let's consider this XML, that represent a minimal version of the interesting XML book I'm reading while writing these posts:

<?xml version="1.0" encoding="UTF-8"?>
<Book>
<Title>Beginning XML, 4th Edition</Title>
<Authors>
<Author>David Hunter</Author>
<Author>Danny Ayers</Author>
<Author>al</Author>
</Authors>
<Year>2007</Year>
<Chapters>
<Chapter number="1" title="What is XML?">
XML is a markup language, derived from SGML.</Chapter>
<Chapter number="2" title="Well-formed XML">
To be well-formed an XML document must satisfy several rules about its
structure.</Chapter>
<Chapter number="3" title="Namespaces">
To help unambiguously identify the names of elements and attributes the
notion of an XML namespace is used.</Chapter>
<Chapter number="4" title="DTD">
A document type definition, DTD, is a way to specify the permitted
structure of an XML document.</Chapter>
<Chapter number="5" title="Schemas">
W3C XML Schema and Relax NG are two schema languages to specify the
structure of XML documents.</Chapter>
</Chapters>
</Book>

This is the HTML we would like to get from the input XML:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Beginning XML, 4th Edition</title>
</head>
<body>
<h3>Beginning XML, 4th Edition</h3>
<p>by David Hunter, Danny Ayers, & al.</p>

<h3>Table of Contents</h3>
<p><b>1:</b>What is XML?</p>
<p><b>2:</b>Well-formed XML</p>
<p><b>3:</b>Namespaces</p>
<p><b>4:</b>DTD</p>
<p><b>5:</b>Schemas</p>

<h3>1. What is XML?</h3>
<p>XML is a markup language, derived from SGML.</p>

<h3>2. Well-formed XML</h3>
<p>To be well-formed an XML document must satisfy several
rules about its structure.</p>

<h3>3. Namespaces</h3>
<p>To help unambiguously identify the names of elements and
attributes the notion of an XML namespace is used.</p>

<h3>4. DTD</h3>
<p>A document type definition, DTD, is a way to specify
the permitted structure of an XML document.</p>

<h3>5. Schemas</h3>
<p>W3C XML Schema and Relax NG are two schema languages to
specify the structure of XML documents.</p>
</body>
</html>

The solution requires us to use an XSLT like this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<html>
<head>
<title><xsl:value-of select="/Book/Title"/></title>
</head>
<body>
<h3><xsl:value-of select="/Book/Title"/></h3>
<p>by <xsl:apply-templates select="/Book/Authors/Author"/></p> <!-- 1 -->
<h3>Table of Contents</h3>
<xsl:apply-templates select="/Book/Chapters/Chapter" mode="TOC"/> <!-- 2 -->
<xsl:apply-templates select="/Book/Chapters/Chapter" mode="full"/> <!-- 3 -->
</body>
</html>
</xsl:template>

<xsl:template match="Author">
<xsl:value-of select="."/>
<xsl:if test="position() != last()"> <!-- A -->
<xsl:text>, </xsl:text>
</xsl:if>
<xsl:if test="position() = last()-1"> <!-- B -->
<xsl:text>&amp; </xsl:text>
</xsl:if>
<xsl:if test="position() = last()"> <!-- C -->
<xsl:text>.</xsl:text>
</xsl:if>
</xsl:template>

<xsl:template match="Chapter" mode="TOC"> <!-- D -->
<p>
<b><xsl:value-of select="@number"/>:</b>
<xsl:value-of select="@title"/>
</p>
</xsl:template>

<xsl:template match="Chapter" mode="full"> <!-- E -->
<h3><xsl:value-of select="@number"/>. <xsl:value-of select="@title"/></h3>
<p><xsl:value-of select="."/></p>
</xsl:template>

</xsl:stylesheet>

The template matching the root element has three xsl:apply-templates elements:
1. It is not relevant for the mode attribute usage, but it is quite interesting for the usage of xsl:if elements combined with XPath() functions. In (A) we see that a comma-blank string is put in the output document if the current element passed to the template has position() different to the last(). In (B) we check if the current position() is last() -1, if so we put also an ampersand. And finally (C), if this is the last() element, we close the sentence with a full stop.
2. This apply-templates specifies a select and a mode tag, the template applied here would be the one matching both parameters: (D). This is the template for the Table Of Content that makes use of the number and title attribute of the passed Chapter element.
3. We use the same select attribute to choose the template, but this time the mode is "full", so we pick the (E) template up, in which also the child text for the current element is used.

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

Sorting in XSLT

Using XSLT we have a way of sorting a list of elements by the xsl:sort element.

The starting point is this XML file:

<?xml version="1.0" encoding="UTF-8"?>
<Objects>
<Object name="Car">
<Characteristic>Hard</Characteristic>
<Characteristic>Shiny</Characteristic>
<Characteristic>Has 4 wheels</Characteristic>
<Characteristic>Internal Combustion Engine</Characteristic>
</Object>
<Object name="Orange">
<Characteristic>Fruit</Characteristic>
<Characteristic>Juicy</Characteristic>
<Characteristic>Dimpled skin</Characteristic>
<Characteristic>Citrus</Characteristic>
</Object>
<Object name="Giraffe">
<Characteristic>Tall</Characteristic>
<Characteristic>Four legs</Characteristic>
<Characteristic>Big spots</Characteristic>
<Characteristic>Mammal</Characteristic>
</Object>
<Object name="Prawn Cracker">
<Characteristic>Crisp</Characteristic>
<Characteristic>Savoury</Characteristic>
<Characteristic>Off white</Characteristic>
<Characteristic>Edible</Characteristic>
</Object>
</Objects>

We want to generate an HTML where the Objects are ordered by ascending name, with each object having its Characteristic ordered by content in descending alphabetical order:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Object Characteristics</title>
</head>
<body>
<h3>Characteristics of Car</h3>
<ul>
<li>Shiny</li>
<li>Internal Combustion Engine</li>
<li>Has 4 wheels</li>
<li>Hard</li>
</ul>
<h3>Characteristics of Giraffe</h3>
<ul>
<li>Tall</li>
<li>Mammal</li>
<li>Four legs</li>
<li>Big spots</li>
</ul>
<h3>Characteristics of Orange</h3>
<ul>
<li>Juicy</li>
<li>Fruit</li>
<li>Dimpled skin</li>
<li>Citrus</li>
</ul>
<h3>Characteristics of Prawn Cracker</h3>
<ul>
<li>Savoury</li>
<li>Off white</li>
<li>Edible</li>
<li>Crisp</li>
</ul>
</body>
</html>

Here is the XSLT that we use:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<html>
<head>
<title>Object Characteristics</title>
</head>
<body>
<xsl:apply-templates select="/Objects/Object">
<xsl:sort select="@name"/> <!-- 1 -->
</xsl:apply-templates>
</body>
</html>
</xsl:template>

<xsl:template match="Object">
<h3>Characteristics of <xsl:value-of select="@name"/></h3>
<ul>
<xsl:for-each select="Characteristic">
<xsl:sort select="." order="descending"/> <!-- 2 -->
<li><xsl:value-of select="."/></li>
</xsl:for-each>
</ul>
</xsl:template>

</xsl:stylesheet>

1. apply-templates this time is not an empty element, as it usually is, but it has a child xsl:sort element with specified as select attribute the value used for sorting. In this case we use the name attribute for the Object element. The xsl:sort element has another attribute, order, that is defaulted to ascending. Since we actually want to have the Object sorted by ascending name, we can avoid to use it.
2. inside xsl:template we use again xsl:sort, here as child of a xsl:for-each element. We specify that we work with the current value (using the dot notation) and that the order is descending.

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

Looping in XSLT

Using XSLT we can implement a loop using a xsl:for_each element.

As an example, consider this XML, that should be used for an electronic version of a "guess what I am" game:

<?xml version="1.0" encoding="UTF-8"?>
<Objects>
<Object name="Car">
<Characteristic>Hard</Characteristic>
<Characteristic>Shiny</Characteristic>
<Characteristic>Has 4 wheels</Characteristic>
<Characteristic>Internal Combustion Engine</Characteristic>
</Object>
</Objects>

From this give XML we want to extract such an HTML:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Object Characteristics</title>
</head>
<body>
<h3>Characteristics of Car</h3>
<ul>
<li>Hard</li>
<li>Shiny</li>
<li>Has 4 wheels</li>
<li>Internal Combustion Engine</li>
</ul>
</body>
</html>

To do that, in our XSLT we would pass the Object element to a template, and there we want to iterate an all its Characteristic element:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<html>
<head>
<title>Object Characteristics</title>
</head>
<body>
<h3>Characteristics of <xsl:value-of select="Objects/Object/@name"/></h3>
<xsl:apply-templates select="/Objects/Object"/>
</body>
</html>
</xsl:template>

<xsl:template match="Object">
<ul>
<xsl:for-each select="Characteristic">
<li>
<xsl:value-of select="."/>
</li>
</xsl:for-each>
</ul>
</xsl:template>

</xsl:stylesheet>

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

XSLT conditional processing

If we want to process an element only if a condition is verified, we can use the xsl:if tag. If we need something more sophisticated, we have a more complex alternative, based on the xsl:choose tag, similiar to the C switch construct.

Let's show the usage of both of them with a couple of examples. As a base for them we'll use this XML:

<?xml version="1.0" encoding="UTF-8"?>
<Characters>
<Character age="99">Julius Caesar</Character>
<Character age="23">Anne Boleyn</Character>
<Character age="41">George Washington</Character>
<Character age="45">Martin Luther</Character>
<Character age="800">Methuselah</Character>
<Character age="119">Moses</Character>
<Character age="50">Asterix the Gaul</Character>
</Characters>

A list of characters, each of them having an attribute, age, that we want to use as a selector for taking a decision.

xsl:if

Here we want to generate an HTML file containing all the Character elements having a suspicious age, more than 110. Like this:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Age check on Characters.</title>
</head>
<body>
<h3>The recorded age is unusually high.</h3>
<p><b>Methuselah</b> is older than expected.
Please check if <b>800</b> is correct.
</p>

<p><b>Moses</b> is older than expected.
Please check if <b>119</b> is correct.
</p>
</body>
</html>

This is an XSLT that we could use for that:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:output method="html"/>
<xsl:template match="/">
<html>
<head>
<title>Age check on Characters</title>
</head>
<body>
<h3>The recorded age is unusually high.</h3>
<xsl:apply-templates select="/Characters/Character"/>
</body>
</html>
</xsl:template>

<xsl:template match="Character">
<xsl:if test="@age &gt; 110 ">
<p><b><xsl:value-of select="."/></b> is older than expected.
Please check if <b><xsl:value-of select="@age"/></b> is correct.</p>
</xsl:if>
</xsl:template>

</xsl:stylesheet>

In the "test" attribute of the xsl:if tag we check a condition. And when it is true, we actually put in the resulting document the content of the tag. Notice that we specify age as an attribute of the current element (using @) and that we can't use directly the greater ('>') symbol.

xsl:choose

We want now generating an HTML a bit more complex, something like that:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Age check on all Characters</title>
</head>
<body>
<h3>The following is the assessment of the age data.</h3>
<p><b>Julius Caesar</b> - ok.</p>
<p><b>Anne Boleyn</b> - ok.</p>
<p><b>George Washington</b> - ok.</p>
<p><b>Martin Luther</b> - ok.</p>
<p><b>Methuselah</b> - please check if<b>800</b>, is the correct age.</p>
<p><b>Moses</b> - please check if <b>119</b>, is the correct age.</p>
<p><b>Asterix the Gaul</b> - ok.</p>
</body>
</html>

So, if the age is suspiscious, we generate an alert as before, otherwise we tell the user that character looks good to us.

This is the XSLT I have used to generate that document:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<html>
<head>
<title>Age check on all Characters</title>
</head>
<body>
<h3>The following is the assessment of the age data.</h3>
<xsl:apply-templates select="/Characters/Character"/>
</body>
</html>
</xsl:template>

<xsl:template match="Character">
<xsl:choose>
<xsl:when test="@age &gt; 110 ">
<p><b><xsl:value-of select="."/></b> - please check if
<b><xsl:value-of select="@age"/></b> is the correct age.</p>
</xsl:when>
<xsl:otherwise>
<p><b><xsl:value-of select="."/></b> - ok.</p>
</xsl:otherwise>
</xsl:choose>
</xsl:template>

</xsl:stylesheet>

The test is performed exactely as in the xsl:if, but the structure is a bit more complex. We have a xsl:choose tag that includes an xsl:when, working as the xsl:if, but used in conjunction with an xsl:otherwise specifying the alternative path.

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

Deep copying with xsl:copy-of

Still talking about XSLT with Saxon, we have just see how to use xsl:copy, that we can think as a sort of shallow copy, now we'll see what about xsl:copy-of, that could be described as a deep element copy mechanism: we use it to copy a full element, its attributes and children included, from input to output.

An example of usage for xsl:copy-of could be this one. We have an XML with information about a buying order:

<?xml version="1.0" encoding="UTF-8"?>
<Order>
<From>ThisStuff</From>
<To>Buyer</To>
<DeliveryAddress>
<Street>Street</Street>
<City>City</City>
<State>State</State>
<ZipCode>12345</ZipCode>
<!-- more delivery information to be added -->
</DeliveryAddress>
<!-- other stuff here -->
</Order>

We want to generate another XML for the invoice. There the root element would be called Invoice, we want to swap From and To elements, and we want to keep unchanged the DeliveryAddress:

<?xml version="1.0" encoding="UTF-8"?>
<Invoice>
<From>Buyer</From>
<To>ThisStuff</To>
<DeliveryAddress>
<Street>Street</Street>
<City>City</City>
<State>State</State>
<ZipCode>12345</ZipCode>
<!-- more delivery information to be added -->
</DeliveryAddress>
<!--more invoice information to be added.-->
</Invoice>

To get such a result, we could use this transformation:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<Invoice>
<xsl:apply-templates select="/Order/To" />
<xsl:apply-templates select="/Order/From" />
<xsl:apply-templates select="/Order/DeliveryAddress" />
<xsl:comment>more invoice information to be added.</xsl:comment> <!-- 1 -->
</Invoice>
</xsl:template>

<xsl:template match="To">
<xsl:element name="From">
<xsl:value-of select="."/> <!-- 2 -->
</xsl:element>
</xsl:template>

<xsl:template match="From">
<xsl:element name="To">
<xsl:value-of select="."/>
</xsl:element>
</xsl:template>

<xsl:template match="DeliveryAddress">
<xsl:copy-of select="."/> <!-- 3 -->
</xsl:template>

</xsl:stylesheet>

1. That's how we insert a comment in the output XML, using the xsl:comment element.
2. A select on "." (dot) returns the current element.
3. Here we are using copy-of, so the complete element in input is copied to the output XML. Notice that even the comment is copied.

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

From attributes to child elements

Another context in which makes sense to use the xsl:copy element (I'm in XSLT with Saxon mode, if you wonder) is where you want to do just the opposite of the previous post. There we transformed child elements in attributes, here we are about to create child elements from attributes.

Want we want to do now, actually, is reversing the transformation we have done in the previous post. Now we get in input what was the expected output, where the Person elements have two attributes and no child nodes, and we want in output an XML where Person elements are with no attributes but child nodes instead.

Here is the transformation we are going to apply:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<People>
<xsl:apply-templates select="/People/Person"/>
</People>
</xsl:template>

<xsl:template match="Person">
<xsl:copy> <!-- 1 -->
<xsl:element name="FirstName"> <!-- 2 -->
<xsl:value-of select="@FirstName"/> <!-- 3 -->
</xsl:element>
<xsl:element name="LastName">
<xsl:value-of select="@LastName"/>
</xsl:element>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

1. Again, we are generating a stripped-down copy of the current input node.
2. As a child node to the newly generated Person, we create a FirstName element.
3. And as value we use the attribute (notice the @ sign) FirstName of the input Person element.
And then the same for the LastName.

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

From child elements to attributes

Still working on XSLT with Saxon, here we'll have a look at the xsl:copy element that copies just a bare node from the input xml to the resulting file, without considering descendant node or attribute, if any.

An usage for such a stripped down copy is showed here, where we want to transform an xml swapping the child nodes as attributes. This is the input:

<?xml version="1.0" encoding="UTF-8"?>
<People>
<Person>
<FirstName>Tom</FirstName>
<LastName>Smith</LastName>
</Person>
<Person>
<FirstName>Bill</FirstName>
<LastName>Krill</LastName>
</Person>
<Person>
<FirstName>Phil</FirstName>
<LastName>Delphi</LastName>
</Person>
</People>

We want to get this as output:

<?xml version="1.0" encoding="UTF-8"?>
<People>
<Person FirstName="Tom" LastName="Smith"/>
<Person FirstName="Bill" LastName="Krill"/>
<Person FirstName="Phil" LastName="Delphi"/>
</People>

Here is the XSLT we use:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

<xsl:template match="/">
<People>
<xsl:apply-templates select="/People/Person"/> <!-- 1 -->
</People>
</xsl:template>

<xsl:template match="Person"> <!-- 2 -->
<xsl:copy> <!-- 3 -->
<xsl:attribute name="FirstName"> <!-- 4 -->
<xsl:value-of select="FirstName"/> <!-- 5 -->
</xsl:attribute>
<xsl:attribute name="LastName">
<xsl:value-of select="LastName"/>
</xsl:attribute>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

1. we use apply-templates on each person
2. here is the called template relative to (1)
3. copy the current element (a Person, actually)
4. generate a FirstName attribute in the copied (stripped down) node
5. set as value-of the output FirstName attribute the value of the input FirstName child
And then the same for a second attribute.

And here how I called Saxon:
java -jar c:\dev\saxon\saxon9he.jar in.xml change.xslt -o:out.xml

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

XSLT with Saxon

The Home Edition of Saxon is available free on SourceFourge for Java and .NET - and the Java version is my choice.

A typical usage of XLST is for transforming an internal XML data representation to a format more suitable for a client application.

Let's see an example. Say that we have an XML like this:
<?xml version="1.0" encoding="UTF-8"?>
<People>
<Person>
<Name>Winston Churchill</Name>
<Description>Winston Churchill was a mid 20th century British politician who
became famous as Prime Minister during the Second World War.</Description>
</Person>
<Person>
<Name>Indira Gandhi</Name>
<Description>Indira Gandhi was India’s first female prime minister
and was assassinated in 1984.</Description>
</Person>
<Person>
<Name>John F. Kennedy</Name>
<Description>JFK, as he was affectionately known, was a United States
president who was assassinated in Dallas, Texas.</Description>
</Person>
</People>

The root entity, People, contains a list of Person, each of them having a Name and a Description.

We want transform it to an HTML, so to show it to the user as a web page that should look in this way:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Information about 3 people.</title>
</head>
<body>
<h3>Winston Churchill</h3>
<p>Winston Churchill was a mid 20th century British politician who
became famous as Prime Minister during the Second World War.
</p>
<h3>Indira Gandhi</h3>
<p>Indira Gandhi was India’s first female prime minister
and was assassinated in 1984.
</p>
<h3>John F. Kennedy</h3>
<p>JFK, as he was affectionately known, was a United States
president who was assassinated in Dallas, Texas.
</p>
</body>
</html>

That's what XSLT is good at. What we have to do, it is writing a stylesheet saying how our XSLT processor has to generate the output. In our case, we want something like:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<html>
<head>
<title>Information about <xsl:value-of select="count(/People/Person)"/> people.</title>
</head>
<body>
<xsl:apply-templates select="/People/Person"/>
</body>
</html>
</xsl:template>

<xsl:template match="Person">
<h3><xsl:value-of select="Name"/></h3>
<p><xsl:value-of select="Description"/></p>
</xsl:template>
</xsl:stylesheet>

Line 1: This document is an XML.
Line 2: Actually, it is an XLS stylesheet version 2.0, as defined by w3.org in 1999.
Line 3: We ask the processor to consider the root element.
Line 6: We call the function count() specifying the path of the objects that we want to count - in this case, the number of Persons under People - we assign the result to the select attribute of the xsl element value-of. And that's how we'll have the resulting value in the generated document.
Line 9: Sort of calling a function, or better, applying the template that is defined below, passing as parameter each found element /People/Person.
Line 14: Here is the template that we "call" from line 9, as input we expect a Person, and we use, putting them in the select attribute of an xsl value-of element, its Name and Description: the first in an HTML third-level header tag, the latter in a pragraph.

Having the input XML, the XLST transformation, we just have to call Saxon from command line to do the magic:
java -jar c:\dev\saxon\saxon9he.jar input.xml example.xslt -o:out.html
To make it run correctly the java\bin folder should be in your system path; you should explicitly give the path to the Saxon executable JAR (putting it in the CLASSPATH does not work); the input.xml and the example.xslt, as described above, are supposed to be in the current folder; and the result is sent to out.html, again in the current folder.

Remember to put a colon after the -o, otherwise you get the puzzling error message "Command line option -o requires a value".

More information on XSLT and Saxon in chapter eight of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

XSD global and local elements

In an XML Schema we could specify an element creating a local type, using a global type, or referencing to an already existing global type.

The elements declared as direct children of the "schema" element are implicitly global. All the element declared as children of other than the "schema" element are local and could be used only in that context.

In the first XSD we have written we have a global element, name, and three local ones, children of name.

We can make all the elements global, and using the required ones in "name" by reference:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:target="ThisThreadXMLSchema"
targetNamespace="ThisThreadXMLSchema"
elementFormDefault="qualified">
<element name="first" type="string"/>
<element name="middle" type="string"/>
<element name="last" type="string"/>
<element name="name">
<complexType>
<sequence>
<element ref="target:first"/>
<element ref="target:middle"/>
<element ref="target:last"/>
</sequence>
<attribute name="title" type="string"/>
</complexType>
</element>
</schema>

We can extract the complex type from the "name" element definition, put it under "schema", and then using it as "type" for the "name" element definition:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:target="ThisThreadXMLSchema"
targetNamespace="ThisThreadXMLSchema"
elementFormDefault="qualified">
<element name="first" type="string"/>
<element name="middle" type="string"/>
<element name="last" type="string"/>
<complexType name="NameType">
<sequence>
<element ref="target:first"/>
<element ref="target:middle"/>
<element ref="target:last"/>
</sequence>
<attribute name="title" type="string"/>
</complexType>
<element name="name" type="target:NameType"/>
</schema>

More information on XML Schema in chapter five of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

XML Schema

Instead of using DTD we can use an XML Schema to validate an XML document.

DTD is still popular, and the best choice for some case, but, as we are going to see, XML Schema has a few relevant advantages that makes it worth to be used.

An XML can't be embedded in an XML document, but should be stored in a standalone file, usually identified by the xsd extension.

Here is an XSD that defines an XML with a root element, "name", having an attribute, "title", and a sequence of three contained elements:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
xmlns:target="ThisThreadXMLSchema"
targetNamespace="ThisThreadXMLSchema"
elementFormDefault="qualified">
<element name="name">
<complexType>
<sequence>
<element name="first" type="string"/>
<element name="middle" type="string"/>
<element name="last" type="string"/>
</sequence>
<attribute name="title" type="string"/>
</complexType>
</element>
</schema>

The root attribute should be defined after the sequence of elements in it. It is not mandatory, so could be present or not in our XML.
On the other side, first, middle and last are in a sequence, so all three of them should be in the XML, in that specified order.

Here is the example of a valid XML for the specified XSD (saved in the same directory with name example.xsd):

<?xml version="1.0" encoding="UTF-8"?>
<name
xmlns="ThisThreadXMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="ThisThreadXMLSchema example.xsd"
title="Mr.">
<first>John</first>
<middle>Fearless</middle>
<last>Smith</last>
</name>

More information on XML Schema in chapter five of Beginning XML by David Hunter et al. (Wrox).

Go to the full post

The element's attributes

Defining an attribute in DTD is similar to what we have done for the elements.

Naturally, we define attributes for elements. So, after saying that what we are talking about is an attribute - using the keyword ATTLIST - we spacify the related attribute, than the name we want to give to this attribute, its type and considerations on its value.

Here is an example:

<!ELEMENT reference (#PCDATA)>
<!ATTLIST reference source CDATA #REQUIRED>

We defined source as an attribute for reference, specifying it is just a character string, and that it is mandatory.
Now, in a XML defined accordingly to a DTD including these lines, a reference should look like this:
<reference source="12">42</reference>
If the attribute is not mandatory, we call it #IMPLIED.

If an attribute is used as an identifier, we use the keyword ID, and we should ensure the value is defined accordingly to the XML naming rules.

Often it makes sense that an attribute could assume just a few values, we can enforce this requisite showing an enumerated list of the acceptable values:
<!ATTLIST contact kind (Friend | Business | Relative) #IMPLIED>
Being IMPLIED, the kind attribute could not be present, buf if it is, it should have one of the enlisted valued:
<contact kind="Relative">

We can set a default value, specifying it as a double-quote delimited string:
<!ATTLIST contact kind (Friend | Business | Relative) "Friend">

The #FIXED attribute are used to implements constant values.

It is possible to specify more than an attribute, both in the same ATTLIST tag:

<!ATTLIST contact kind (Friend | Business | Relative) "Friend"
code CDATA #IMPLIED>

Or putting any attribute in its own single tag. Just a matter of testing.

For more details I suggest you to read Beginning XML by David Hunter et al (Wrox).

Go to the full post

What's in an element

We have already seen a DTD example, containg elements declarations that, after the declaration, include the element name definition and the element content model:
<!ELEMENT name (first, middle, last)>
If an element contains other elements, they could be organized in sequences or choices.

The name element defined above contains a sequence of elements. The name element should contains all the three specified elements, and they have to be in the specified order.

Say that we want to store contacts in our XML in two alternative formats, like names or references. This constrain is written:
<!ELEMENT contact (name | reference)>
A variation on this schema is a combination of sequences and choices. So, we could write:
<!ELEMENT contact (name | (reference, referee)>
meaning that we could choose if inserting the name or the sequence of two elements, reference and referee.

An element could have a mixed content. We have seen how to say that an element includes text data
<!ELEMENT description (#PCDATA)>
but this won't work if we want give the chance of inserting tags (for instances an HTML em or strong tags) in it. We should explicitly specify which tags is possible to use in that context:
<!ELEMENT description (#PCDATA | em | strong)*>
Here we say that in description free text is expected, and possibly some em and strong tags. Notice the star (*) at the end, meaning that we could have many different sections one after the other.

Sometimes we expect no content for an element, as a typical example we could think to the HTML br tag. If we want the element "marker" being empty, we specify it in the DTD like this:
<!ELEMENT marker EMPTY>
On the opposite, we could even leave a total freedom on the element content saying that any content is accepted:
<!ELEMENT description ANY>

For more details I suggest you to read Beginning XML by David Hunter et al (Wrox).

Go to the full post

DTD for XML

We can define the content of an XML in a DTD (Document Type Definition). The DTD could be embedded in the XML file or, more commonly, placed in a stand alone file.

Here is an example of DTD:

<!ELEMENT name (first, middle, last)>
<!ELEMENT first (#PCDATA)>
<!ELEMENT middle (#PCDATA)>
<!ELEMENT last (#PCDATA)>

We are saying that the XML should consist of a root element named "name" that should include three elements, "first", "middle", "last", each of them a string of characters.

Say that this DTD is stored in a file named example.dtd, and we want example.xml, in the same folder, to use it. We'll write that xml in this way:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE name PUBLIC "- //ThisThread//First DTD//EN" "example.dtd">
<name>
<first>John</first>
<middle>Fearless</middle>
<last>Smith</last>
</name>

The doctype tag specifies in the XML the root element name and how to get the DTD. We could provide an id to the DTD using a Formal Public Identifier (FPI) in this format:
-//Owner//Class Description//Language//Version
After the FPI we specify the file name where the DTD is stored.

If you need more information on XML, a good book is Beginning XML by David Hunter et al (Wrox). I'm reading it while writing this post.

Go to the full post