Articles | Introduction to XML
| back
Introduction
The second XML (Extensible Markup Language) draft is
out but since it is only a recommendation, every implementation of XML is
a guess at what XML will eventually become, and at the same time the
recommendation is available for discussion.
"What's in it for us," you ask? Quite a bit.
XML offers the most near-term benefits for professional web developers, in
particular those who are working on putting large numbers of complex documents
on-line. HTML is quite limiting. It does not offer very rich semantics to
describe a document.
If you're designing data-hungry sites, especially for
intranets, you should be getting excited about XML, because in XML, you'll be
able to create and respond to much richer set of data elements. That will in
turn let you build more individualized dynamic sites and pages. For example,
your site's users could access information across databases and types of data
without having to rely on a search engine.
Currently, Microsoft Internet Explorer 4.0 is the only
browser that supports XML, but more on that later.
What is XML?
XML is all about metadata and the idea that certain
groups of people have similar needs for describing and organizing the data they
use. Like HTML, XML is a set of tags and declarations -- but rather than being
concerned with formatting information on a page, XML focuses on providing
information about the data itself and how it relates to other data.
Some data types are pretty much universal (<First
Name>, <Address>, <City>, and so forth). Others are industry or
even company-specific (<price>, <manufacturer>, <componentID>).
Healthcare organizations, for example, have a whole set of data types and
acronyms understandable (some would say penetrable) only to claims processors.
XML allows each of these data types to be easily recognized and, for site
developers, used to create sites optimized around both the data and the people
using it.
XML differs from HTML in three major respects:
- Information providers can define new tag and
attribute names at will.
- Document structures can be nested to any level of
complexity.
- Any XML document can contain an optional description
of its grammar for use by applications, that need to perform structural
validation.
XML is not backwards compatible with existing HTML
documents, but documents conforming to HTML3.2 can easily be converted to XML,
as can generic SGML documents and documents generated from databases.
Some XML history
In November 1996 the initial XML draft was presented at
the SGML 96 Conference in Boston. Then in March 97, the 1st XML Conference was
held in San Diego, by the Graphic Communications Association.
In April 97, we then got the initial XML Linking
Working Draft. In July it got revised, and then in August 97, we also got the
revised XML Syntax Working Draft, plus the XML Developers Day was held in
Montreal Canada on August 21st.
In October the W3C came with a note on 'W3C Data
Formats on XML, SGML, HTML, and RDF'.
In December the XML 1.0 Proposed Recommendation
arrived, and in February 1998, we now have the 2nd XML draft out
Enough history, let's get right at it....
Will it replace HTML
Doubtful. At least not in the near term. Initially, I
expect we will see XML used as a storage format, and HTML used as the display
format. Just run your XML document through a filter and out comes an HTML
document. This will provide backward compatibility support for legacy HTML
browsers. I think that we will continue to see this for at least 2-3 years.
Although, use of native XML browsers will increase throughout that time, and
eventually eclipsing HTML, relegating HTML to purely legacy support. This all
depends on the tools.
Initially, XML will be difficult to use, and expensive.
Only large firms who have clear and distinct needs and the money needed to
support it will use it.
HTML has the advantage of being very simple to use. XML
is not difficult, but it's not that easy either. So HTML will probably continue
to be used by the general public because it's so simple to use.
Software development efforts take time though. It takes
6 months to do a good new product revision, plus a beta-testing cycle, which
means it could be a year or more before many of these products become available.
Furthermore, the XML standard isn't even all the way
hammered out. The XML data, style and linking pieces have yet to be completed.
Each is in various draft stages. We may not see a complete cohesive XML standard
before December 31, 1998. On the other hand, the whole XML standards process has
been moving along at quite a rapid pace, so we might be surprised and see
something sooner.
Considering the time it takes for a technology to
become truly mainstream, a 2-3 year adoption curve, with a couple of years
tacked on to that until we see the really spectacular implementations, is
probably not unrealistic, in my opinion.
XML and HTML complement each other. Browsers will be
able to process both, and future HTML standards will likely allow mixing HTML
and XML in the same document.
What about existing HTML documents? Am I going to have
to re-code all of them in XML? Will XML-native browsers also support HTML
documents as well? These are all open questions.
Basically, if your HTML document uses quotes around ALL
of the attributes and closes ALL of the tags, then it's awfully close to being
well formed.
I think that realistically, we will see both browsers
which support XML and HTML. Just like early web browsers built in support for
FTP and Gopher in addition to HTML. These protocols continue to be supported. So
you won't necessarily have to convert all of your documents. On the other hand,
you may want to. In order to help facilitate that, we will probably see
HTML-to-XML conversion utilities. Naturally, the quality of the resulting
documents will vary. Some will be good. Some won't be. Automation can only take
you so far.
Building XML
XML comes in two flavors: well formed and valid.
Well-formed is the easier standard to meet. It just requires that a document has
an XML prologue, that all elements be nested cleanly, and that all start tags
have matching end tags. "Empty" tags like IMG, which don't normally
have closing tags, may end with a "/>" instead of
receiving a full end tag. For instance, the HTML:
will become
<IMG SRC="mygif.gif"></IMG>
|
or
The XML prolog is the most obvious change from either
SGML or HTML:
<?XML VERSION="1.0" RMD="NONE" ENCODING="UTF-8"?>
|
The VERSION attribute should always be included, to
protect documents against changes in the standard. RMD is short for Required
Markup Declaration and announces which, if any, document type declarations (DTDs)
should be applied to the document. For well-formed documents this will be
"NONE." Valid documents may use "INTERNAL" or
"ALL." ENCODING tells the parser what kind of character set the
document will use. UTF-8, a subset of Unicode, is the default. (XML parsers must
support the full 16-bit Unicode standard for international character encoding,
however.)
These minimal changes to the world of mark-up make life
much easier for parser developers, who no longer have to support poorly coded
HTML missing half its end tags. Before a document can call itself well formed
XML, it has to meet minimum requirements. This requires some extra effort from
those creating documents, but makes it possible for programmers to build much
more reliable systems with much less effort.
Valid documents must be accompanied by a document type
declaration (DTD) that defines their structure. The DTD may be included as part
of the document itself, or it may be stored in a separate document. Most complex
DTDs will probably be stored as separate documents. A DTD is basically a list of
element, entity and attribute declarations in a simplified SGML declaration
style.
Web applications of XML
The applications that will drive the acceptance of XML
are those, that cannot be accomplished within the limitations of HTML. These
applications can be divided into four broad categories:
- Applications that require the Web client to mediate
between two or more heterogeneous databases.
- Applications that attempt to distribute a
significant proportion of the processing load from the Web server to the Web
client
- Applications that require the Web client to present
different views of the same data to different users.
- Applications in which intelligent Web agents
attempts to tailor information discovery to the needs of individual users.
The alternative to XML for these applications is
proprietary code embedded as "script elements" in HTML documents, and
delivered in conjunction with proprietary browser plug-ins or Java applets. XML
derives from a philosophy that data belongs to its creators and that content
providers are best served by a data format, that does not bind them to
particular script languages, authoring tools, and delivery engines, but provides
a standardized, vendor-independent, level playing field upon which different
authoring and delivery tools may freely compete.
An example of the first category of XML applications
could be a information tracking system for a home health care agency. This app
could then have the following functions that are not all accomplishable in HTML:
- Log into the hospitals web site.
- Access the patient's medical records in a Web-based
interface that represents the records for that patient with a folder icon.
- Drag the folder from the app over to the internal
database.
- Drop it into the database.
The app could use XML tags such as <allergies>,
<drug-reaction>, and so on.
You can view the House
of Worship, who already use XML, as a way to allow its members to share
information -- especially on religious discourse. The move is one of the first
implementations by an independent site of the next-generation Web authoring
language. HOW introduces amongst others the <PRAYER> and <SCRIPTURE>
tags.
XML software
Good tools are going to be the thing that makes XML
work. XML is complex enough that you are not going to want to do much of it by
hand. The nice thing is that it looks like a lot of tools vendors are going to
support it. Microsoft is talking about making it the native default file format
for upcoming versions of MS office, including MS Word, Excel and PowerPoint.
This could mean that one could potentially serve these documents directly out
onto the web without having to convert them or mark them up by hand. I suspect
that we will be seeing similar functionality available in a future version of
Corel WordPerfect as well, although I haven't heard any announcements yet. Tool
support will be necessary, not just for putting primary content documents into
XML format, but also for supporting and maintaining large collections of
documents. Tools will be necessary for creating new documents which combine
content from various different sources. There is plenty of opportunity for
innovation here.
A number of other commercial vendors are preparing XML
software tools. In addition, aided by XML's relative simplicity, many
individuals and academic institutions are undertaking XML efforts.
As part of IBM's support for the World Wide Web
Consortium's (W3C) endorsement of XML 1.0 as a Web standard, IBM has released an
alpha version of its XML for Java technology. The W3C, an international group
that oversees Web standards, is promoting XML as a language to let applications
interchange data with greater precision than standard HTML can provide.
Web developers seeking to increase their familiarity
with XML should check out XML for Java -- developed at IBM's
Tokyo Research Lab, and available on IBM's alphaWorks. XML for Java is an
XML processor written entirely in Java; with it, Web developers can parse,
process, and create XML documents.
Leading examples of XML tools available for free
non-commercial also use include the following:
NXP is a validating XML parser written in Java by
Norbert Mikula.
http://www.edu.uni-klu.ac.at/~nmikula/NXP
Lark is a non-validating XML processor written in Java
by Tim Bray.
http://www.textuality.com/Lark/
XP is another non-validating XML processor written in
Java, by James Clark.
http://www.jclark.com/xml/xp/index.html
MSXML is a validating XML parser written in Java by
Microsoft.
http://www.microsoft.com/xml/parser/jparser.asp
clXML is a validating XML parser written in Tcl by
Steve Ball.
http://tcltk.anu.edu.au/XML/
LT XML is an XML developers' toolkit from the Language
Technology Group at the University of Edinburgh.
http://www.ltg.ed.ac.uk/software/xml/
JUMBO is a Java-based XML browser designed for the
Chemical Mark-up Language, an XML application developed by Peter Murray-Rust.
http://www.venus.co.uk/~pmr/README
DSC is a DSSSL syntax checker and development
environment available from the Language Technology Group at the University of
Edinburgh.
http://www.ltg.ed.ac.uk/~ht/dsc-blurb.html
XML and style sheets
Because XML is really about specifying characteristics
of data, and not simply presenting it, you will need to write style sheets to
use it. Since DHTML, CSS and CDF's are all standards supported by both Netscape
and Microsoft, you can start using XML today. Also, new tools are constantly
emerging to evaluate your XML conventions and ensure that others parsers can use
them as you intended.
The Extensible Style Language (XSL) represents a early
attempt to create a more dynamic and powerful notation for defining document
style, and to augment the capabilities of the Cascading Style Sheets work (CSS1
and CSS2) already in place at the W3C. Objectives here include a model that can
dynamically resize itself completely around base font selections (which CSS
cannot currently handle) and to provide more powerful, interactive support for
document styles and rendering. At present, this work is largely experimental and
most active development uses CSS1 or CSS2 style sheets for production. But just
as XML represents a strict subset of SGML, the work on XSL derives in large part
from the DSSSL style sheet language developed in the SGML community.
XSL can handle an unlimited number of tags, each in an
unlimited number of ways, by virtue of its extensibility. It brings advanced
layout features to the Web, such as rotated text, multiple columns, and
independent regions. It supports international scripts, all the way to mixing
left-to-right, right-to-left, and top-to-bottom scripts on a single page.
The future of XML
XML could take many different directions, since it is
only still a recommendation, many things can (and probably will) happen. One
direction is to serve as an alternative to HTML. This particular use is going to
take a little while to mature, because you need to populate the world with the
tools to create XML documents and the people who know how to use them and put
them up on their sites. And that's going to take time, because it's going to
require that people make a shift mentally in how they conceive of data.
In the very short term the main impact of XML for Web
developers will be its use in a variety of special-purpose facilities, such as
Microsoft Corp.'s Channel Definition Format (CDF), Marimba's Open Software
Description (OSD) protocol and Vignette and its partners' Information and
Content Exchange (ICE). These are simple easily described languages that do
special-purpose tasks such as channel description, download automation and
syndication negotiation. Anyone who wants to play in these application spaces
will have to learn how to read and write the appropriate XML-based languages.
Fortunately, this is easy, since generating XML is trivial, and parsing it can
be done with any number of freeware parsers available right now in C or Java.
Coming shortly thereafter will probably be RDF, or
Resource Description Framework, a general metadata exchange mechanism based on
XML, currently in the process of being drafted at W3C. This has the potential to
bring dramatic benefits to the worlds of searching, retrieval and many other
aspects of content automation.
Conclusion
If you want to be an early adopter, now is the time to
start reading the standards, looking at the specs, and starting to think how you
could use this technology. XML is not going to catch on all by itself. It takes
people to support it, to build the tools and create the content using it. XML
seems to have a lot of industry support behind it. It offers the potential to do
a lot of things that people want to be able to do.
If you think it can work for you, look into it more.
Then make an informed decision. Take a look at it. Find out how it's being used.
Try it for yourself, just to play with, or in a small pilot project. If the
tools aren't mature enough yet, then wait a few months and look again later. If
XML turns out to be a good technology it will succeed. If not, people will pass
it by.
Everyone, including Netscape is supporting XML. It is
already used to some extent by Netscape, mainly in its own internal output and
IE4 supports it some, but not completely. Its support will be strong in the 5th
generation of both browsers. It is extremely helpful in establishing means to
speak with databases and as a way to have PDF type output, but with access to
the data on the browser.
I urge it be learned in the future. If you want an
excellent use of it now and a great program in addition get Frontier5 (http://www.scripting.com/frontier5/xml/),
which uses it to quickly generate HTML on the fly.
The bad news: At the time of this writing, browsers are
between generations, not yet fully ready to embrace these new technologies and
standards. But this lag may be just what hatching standards need, giving
developers enough time to rethink the way their Web applications should work
before a rewoven Web hits with full force, starting at the end of this year.
Useful links
The main W3C XML page:
http://www.w3.org/XML
W3C Recommendation:
http://www.w3.org/TR/REC-xml
XML Linking draft:
http://www.w3.org/TR/WD-xml-link
XSL Style Proposal:
http://www.w3.org/TR/NOTE-XSL.html
XML Data Note:
http://www.w3.org/TR/1998/NOTE-XML-data-0105/
Microsoft's XML Page:
http://www.microsoft.com/xml
The House of Worship (uses XML):
http://www.housesofworship.net/
Articles | Introduction to XML |
back
|