XML: Why it Matters
View this white paper in Adobe PDF format.
Abstract
The following article is intended as an introduction to XML for people who are reasonably familiar with HTML and / or other structured data file formats. XML may be the single most important innovation in structured information since HTML.
- What is XML?
- Further Information about XML
- The genealogy of XML
- A Copernican revolution
- The virtue of simplicity
- Well-formed and Valid
- How XML fits into the Internet world
- Why XHTML?
- Lux and XML
What is XML?
The World Wide Web Consortium (the "W3C") calls (XML eXtensible Markup Language) "the universal format for structured documents and data on the Web." XML is a W3C Recommendation, which is to say an adopted W3C standard.
When the W3C calls XML "universal," they mean it. In the five years or so since its introduction, XML has enabled the beginning of a transformation of the Web from a means of transmitting documents to a general means of exchanging data. For the most part, XML is being used in three ways:
- Presentation: XML allows the content of a Web page to be expressed independently of the details of its representation. This is useful, for example, if you want essentially the same page to render appropriately on both a wireless PDA and a full-function PC monitor.
- Messaging: XML has become the favored means for programs running on different computers to exchange data. This includes "remote procedure calls," in which a program running on one computer requests that a program on another computer take a particular action and return the results.
- Enterprise Application Integration: Beyond mere messaging, XML enables having information come from a foreign source and appear in your system as though it were part of it.
XML is not just a simple solution to a complex problem. It's a simple solution to several hundred complex problems. Among the many notable applications of XML:
- XHTML reformulates HTML (the Hypertext Markup Language used for most existing Web documents) as an XML application and makes it both simpler and more powerful.
- Voice XML provides the control language for "voice browsers" that allow users to access databases over a telephone connection.
- SOAP (Simple Object Access Protocol) provides a protocol for one computer to call a function or procedure on another computer.
- MathML provides a foundation for the inclusion of mathematical expressions in Web pages.
People have used XML to structure everything from a dictionary of East Asian historical and literary terms to the virtual worlds of online adventure games. And when it comes to business-to-business (B2B) e-commerce systems, XML is king. XML is at the heart of Microsoft BizTalk, Rosetta Net, and the Open Trading Protocol, not to mention the Electronic Business XML Initiative, and the XML Common Business Library.
Many standards are being developed in support of all this e-commerce activity. For example, the W3C is developing XML-based standards for security-related assertions, encryption, and digital signatures. These are currently only in draft form, but for the most part they are functioning as de facto standards.
The genealogy of XML
In 1969, Charles Goldfarb, Ed Mosher, and Ray Lorie were trying to come up with a general way to describe structured data in text form. They invented the Generalized Markup Language (GML). (While they were at it, they coined the now-standard term "markup language" as a way of sneaking all three of their initials "GML" into the name of their invention.)
Over the next seventeen years, GML slowly evolved into ISO-standard SGML. Full-blown SGML is powerful, but it's cumbersome. SGML has the sort of flexibility that practically invites trouble: confronted with two good ways to do something, SGML almost inevitably supports both, providing little guidance to developers.
In 1989, Tim Berners-Lee came up with the idea of the World Wide Web. In 1991 he went public with what is almost certainly the most important SGML application ever produced: HTML. The original HTML was also one of the simplest SGML application ever produced. This tiny, light-weight, narrowly focused markup language evolved through several versions in the 1990s and successfully provided a means to describe the contents of documents for the emergent World Wide Web.
The general success of HTML inspired the W3C in a quest for a comparably simple, human-legible language to describe structured data, very "lightweight" and hence very suitable for use on the Web. One of HTML's few shortcomings also provided a stimulated though: there was really no way to say that a particular HTML document was "correct" except to view it with lots of different browsers and see if they all could handle it successfully. As discussed below, XML introduces clear, testable notions of well-formed and valid documents.
Between 1996 and 1998 so many individuals and companies contributed to the definition of XML that it is virtually impossible to say who "invented" it. (Credit is conventionally given to the SGML Editorial Board of the W3C.) Nonetheless, XML completely defied the conventional wisdom that when a committee sets out to design a horse, they end up with a camel. Years of collective experience led to a genuinely lightweight and highly general markup language. XML is a racehorse.
A Copernican revolution
XML is one of those ideas that, once encountered, are so reasonable that it's hard to imagine you ever considered any other alternative. The Earth revolves around the Sun. Tomatoes are edible. XML is a great way to describe and structure data.
Prior to XML, every time somebody needed a way to exchange data, they spent days making decisions that were ultimately beside the point: instead of working out the characteristics of the particular data in question, they either had to work out a new scheme for exchanging data in general or had to adopt one of over a dozen cumbersome existing approaches.
Prior to XML, every small discrepancy between how two companies represent data was a crisis. Now, if both companies conform to XML, it should be reasonably straightforward to use XSLT (itself an XML application) to convert between formats.
Prior to XML, data formats probably outnumbered the companies using them. The few formats that were widely used were almost vacuous of structural information: for example, the commonly used "comma-separated text" only means something to programs that can agree exactly what data would be in what position in a comma-separated list.
The virtue of simplicity
Simplicity has been the key to XML's success in many areas where SGML never caught on. Few people ever really mastered SGML but a programmer typically gets the hang of XML in a matter of days. XML has become the Mother of Standards. It's almost unimaginable at this time that someone would propose a data-related W3C standard that wasn't either an XML application or a further extension of the XML standard.
If you are familiar with HTML, XML will look familiar. (If you are not familiar with HTML, then you might want to skip forward to the section, "Why XHTML?")Just imagine an HTML-like language, where you could invent new tags and attributes to represent different types of information instead of just layout of Web pages. For example, the following would be a fragment of well-formed XML:
<country name="United States of America" shortname="USA" continent="North America">
<monarch status="no" />
<president status="yes" year="2003">
George W. Bush
</president>
<primeminister status="no" />
<mostpopularfood>
hamburger
</mostpopularfood>
</country>
<country name="United Kingdom of Great Britain and Northern Ireland"
shortname="UK"
continent="Europe">
<monarch status="yes" year="2003">
Elizabeth II
</monarch>
<president status="no" />
<primeminister status="yes" year="2003">
Tony Blair
</primeminister>
<mostpopularfood year="1950">
fish and chips
</mostpopularfood >
<mostpopularfood year="2003">
chicken curry
</mostpopularfood >
</country>
Well-formed and Valid
What do we mean when we say that the XML example here is well-formed? We mean that:
- Every element that is opened is also closed. For example, every country element is closed by a </country> tag. An empty element, such as the monarch element for the United States or the president element for the UK is closed by a simple slash ("/") at the end of the tag that declares the element (e.g., <monarch status="no" />).
- Elements are properly nested. For example, when the document opens a mostpopularfood element inside a country element, it must close the mostpopularfood element before closing the country element.
- The values of attributes - such as name, shortname, or continent for the country element in the example above - are always placed in quotation marks.
Each of these rules impacts XHTML, the XML-based successor to HTML 4.0, but the impacts are small and straightforward. For example:
- Empty elements, such as <br> and <hr> become <br /> and <hr />, so that they are properly closed.
- When you open a paragraph with <p>, you have to close it with </p>.
- <td colwidth=100> becomes <td colwidth="100">.
In addition to the notion of a well-formed document, XML introduces the stricter concept of a valid document: a document that has been validated against one or more DTDs (Document Type Definitions) or XML Schemas. DTDs and XML Schemas are two different ways to define an XML application. For example, XHTML, MathML, XML Signature, etc., are each defined by a DTD or an XML Schema.
(Creating a valid document requires declaring the document type at the start of the document, but we're not going to go into that here. This has always been good practice in HTML. In XHTML it is mandatory.)
DTD is an older mechanism to define a document type, dating back to SGML. Very shortly after the invention of XML, Microsoft invented XML Schema, which has now been adopted as a W3C Recommendation in its own right. XML Schema is itself an XML application, so developers can use all of their usual XML tools to handle schemas. DTD and XML Schema are approximately equivalent in their power, and important XML applications often provide both a DTD and an XML Schema.
Using XML Schemas or DTDs assures that all of the elements, attributes, etc. are appropriate to the type of document in question and that they have an appropriate nesting relationship to one another. For example, our hypothetical president element in the example above would not be part of a valid XHTML document, because XHTML deals with the content of documents, not the presidents of countries. If you tried to insert a president element into an XHTML document, the document would still be well-formed XML, but it would no longer be valid XHTML. Similarly, in our example above, a DTD or Schema could enforce a rule that every country must have a name attribute, while leaving the year attribute of mostpopularfood optional.
(There are also ways to mix - and validate - elements and attributes from multiple DTDs and Schemas in the same document. While that is beyond the scope of this introductory article, it's a really neat feature, allowing common ways of achieving common goals. For example, any XML document that wants to display a mathematical expression can draw on MathML.)
How XML fits into the Internet world
About now, you might ask, "What kind of server do I need for XML?" The short answer is that if you are just talking about delivering up XML documents, any Web server that handles HTML will handle XML.
Why? Well, XML is just a document type. Both HTML documents and XML documents simply consist of text. Within the file, the header that starts the content indicates the type of document. That can be HTML or XML.
On the Web, HTML documents are delivered using HTTP. You can just as easily deliver an XML document with HTTP. This has a lot of interesting consequences:
- Any mechanism you use to generate HTML dynamically – ASP, JSP, CGI/Perl, and so on - can just as easily generate XML.
- That means that all of your methods to get data out of a database into an HTML document will also work for XML.
- If it "works for XML" it can work for the full range of XML applications. For example, any Web server makes a perfectly good Voice XML server. If you want to support Voice XML browsers as well as traditional HTML Web browsers, it's just a matter of creating Voice XML pages parallel to your HTML pages. You don't need any new technology on the server end.
Naturally, when you deliver an XML document, the system that receives it must be able to do something with that document. All contemporary Web browsers render XHTML appropriately. Netscape 7.1 provides native support for MathML, allowing appropriate rendering of complex mathematical expressions; so far it requires third-party support to do the same in Internet Explorer.
However, unlike HTML, XML is not exclusively a system to deliver content for a Web browser. Many XML document types are not primarily intended to be read by a human. For example, B2B e-commerce systems use XML documents and either HTTP or other delivery protocols as a means to exchange data, request actions, etc. Contemporary Web browsers will display such a document in an appropriate format to allow you to examine it, but they cannot do much more with it: the document is intended for an e-commerce system, not a graphical browser.
Why XHTML?
What are the advantages of XHTML over earlier versions of HTML?
- XML makes it much easier to write a browser, because you don't have to try to guess the meaning of an ill-formed HTML document. This is particularly important for devices such as PDAs with a relatively small amount of memory.
- Many existing HTML 4.0 or earlier documents make sense to one browser while their syntax totally confuses another. This is because HTML is a very "loose" standard, so that when a browser sees sloppy HTML, it must do its best to make sense of it. XHTML doesn't solve all browser differences (you can bet that future browsers will still have proprietary features). Still, it tremendously increases the chance that a document that looks great in one browser will do just fine under another browser.
- Any tools designed for XML in general also become tools you can use with your XHTML. For example, a tool designed to facilitate localization of XML files is inherently useful for XHTML.
Lux and XML
Lux staff members have extensive experience with XML in a variety of applications, including:
- Data exchange within a business-to-business (B2B) e-commerce system
- An application to describe collections of collectible gaming cards.
- An XML application to encapsulate realtime sports results.
- Using Voice XML to telephone-enable access to a database content.
- Using XML to encapsulate the contents of an error log for upload to a support system.
We have worked with XML in a variety of environments, including Microsoft .NET. These projects have included use of both DOM (Document Object Model) and SAX (the Simple API for XML) and have involved extensive experience with XSLT, XPath, etc.
Further Information About XML
The W3C maintains an official Web page about XML at http://w3c.org/XML.
A good discussion of mixing multiple namespaces (and hence multiple DTDs/Schemas) can be found at http://www-106.ibm.com/developerworks/library/x-nmspace2.html.
© 2004 The Lux Group, Inc. All rights reserved. This page is provided as a public service to the Web community. As with any copyrighted work, limited quotation with appropriate attribution for purposes of review is permitted. Links to this page are welcome. Explicit permission from Lux is required to otherwise publish, transmit, transfer or sell, reproduce, create derivative works from, or distribute this content, including by incorporating the content into any e-mail. If you wish to reproduce this content, please contact us for permission.
Lux believes that basic information like this should be shared rather than hoarded. Naturally, an article like this only constitutes an introduction to a subject. We hope that if this article has been useful to you, you will consider Lux if you have need for expertise in this area.
Lux
1008 Western Ave. Suite 601
Seattle, WA 98104
phone 206 328 9898
fax 206 328 9899
For permission to reproduce articles:
Contact Us
New Business Inquiries:
Contact Us
Public Relations:
Contact Us
Career Opportunities:
Contact Us
- Web Accessibility and Why You Should Care
- Search Engine Optimization
- Project Lifecycles: Waterfall, Rapid Application Development, and All That
Fred Hutchinson Cancer Research Center
The Macabe Associates
HomeStreet Bank
OnVia
PEMCO
PEMCO
VizX Laboratories
VizX Laboratories
NetReflector
NetReflector
Amgen
Washington BioTech Forum
Washington BioTech Forum
Washington BioTech Forum
Washington BioTech Forum
TriVenture
Valley Medical Center
Valley Medical Center
