SILMARIL Information Management Consultants

About SGML



Silmaril Consultants
About
Services
Contact
Links
Downloads
 
Technologies
SGML
XML
HTML
LaTeX
DTDs
Back to our home page

Where SGML came from

The Standard Generalized Markup Language is an international standard (ISO 8879:1986) for specifying ways to describe text. The objective is to provide a method of identifying the components of a document (eg titles, sections, paragraphs, lists, etc) in a manner that is independent of any make, model, or manufacture of software or hardware.

By and large this has been achieved: SGML files can be used on any kind of computer anywhere in the world, and are not tied to any particular program. SGML has been around for a long time and is very stable, with lots of software and a solid industrial and research base to work from.The information in such files can be edited, extracted, sorted, reprocessed, updated, and displayed or printed by any SGML software, and there are hundreds to choose from: editors, browsers, databases, formatters, etc. See our links page for where to go.

What SGML does

Besides giving your information independence from software and hardware manufacturers, it provides a neutral, stable platform for the storage of your text. The default format is a plaintext file (see below for examples), so it cannot go out of version, or be obsoleted by Marketing or the whim of a supplier.

It also means you can create and store your information once but re-use it many times: on the Web, for example, in print, on a CD-ROM, in audio or video, or as input to a database. There are extensive conversion systems available to output SGML in almost any form.

Why Silmaril uses SGML and XML

People with a responsibility for the management and protection of corporate, sensitive, or important information need to know that the information is stored safely, durably, and productively, in a robust format which is as impervious as possible to changes in technology. SGML-based systems allow us to offer this kind of solution.


SGML 101

This is a short tutorial about SGML structured documents. Everything here applies to XML as well: all that XML does is knock out some of the trickier bits of SGML.

SGML assumes that your documents come grouped into different types, for example letters, reports, specifications, articles, invoices, etc. Within each type of document the basic structure is the same in each instance: each document is made up of recognisable elements like headings, paragraphs, lists, images, tables, etc.

Here is an SGML specification for a simple description of one very common type of document: the Novel.

<!-- Document Type Description for a Novel (PF Aug 1998).
     Refer to this as PUBLIC "+//Silmaril//DTD Simple Novel//EN"
                   or SYSTEM "http://www.silmaril.ie/novel.dtd" -->

<!ELEMENT novel - - (title,author,legal,dedication,preface?,chapter+)>
<!ELEMENT (title,author,legal,dedication,para) - - (#PCDATA) >
<!ELEMENT (preface,chapter) - - (title,para+)>

This document type description (DTD) says:

  1. a novel must have a title, an author, some legal blurb, and a dedication. It may (optionally, that's the question mark) have a preface, and it must have at least one chapter (the plus sign means ‘one or more’). It’s the commas which mean these elements must appear (unless they’re optional) in the order given.

    Already you can see how SGML can be use to force some elements to be present and allow others to be absent or to occur multiple times. A parser can use this information to help prevent data loss or inconsistency.

  2. Titles, the author, the legal blurb, and the dedication are simple text (known as Parsed Character Data or PCDATA). Paragraphs are also just text in this kind of novel.
  3. The preface and chapters are each made up of a title followed by one or more paragraphs (see the plus sign again?).

    Title and paragraph have already been defined so we can use them anywhere that is relevant. An SGML program will automatically let us distinguish between a title that occurs at the top of a novel and the title that is used to identify each chapter.

Now we can use this DTD in an SGML editor to write our novel. In the example below, Emacs/psgml was used: it reads the DTD and gives you mouse-click control over inserting elements, with the start-tags and end-tags coloured separately from the text.

<!DOCTYPE novel PUBLIC "+//Silmaril//DTD Simple Novel//EN" "novel.dtd">

<novel>

  <title>Blink and Miss</title>
  <author>Jaste de Rijn</author>
  <legal>Copyright 1998 Specimen Editions Ltd</legal>
  <dedication>For Narve, who saw and never blinked.</dedication>

<!-- In this instance we don't have any Preface -->

  <chapter>
    <title>Landing low</title>
    <para>Raz picked and pulled at the loose ropes and flapping
    flysheet of his tent as he considered how to arrange them. The
    wind was cold, and so were his fingers...</para>
  </chapter>

  <chapter>
    <title>In a quiet bar</title>
    <para>The road into town was crowded with travellers: the
    traffic seemed to be mostly holidaymakers, but there were supply
    vehicles from a dozen corporations...</para>
  </chapter>

</novel>

Notice in our example novel that we don't have any formatting. SGML systems are for storing text independently of how it ends up looking on the screen or page, so formatting details are always kept separately in a stylesheet. However, if we were using a typographical editor, we could use one stylesheet to make the display easier to read, without affecting the way it would print.

Editing the novel in a typographical editor

The screen image above shows WordPerfect SGML editing the same file. Notice the hierarchical structure display, the stylesheet editor, and the element markup guide. For printing, the tags are hidden, of course. Click on this image for the full-screen 1024x768 shot (108Kb).

HTML

If you think it looks a lot like HTML, you're exactly right. HTML is also specified using SGML: it was the obvious choice for a system that had to work anywhere, on any make or model of computer. HTML is just one of thousands of descriptions of document types in daily use—it just happens to be the most widespread because of the Web.

The better end of the Web editor market now supports HTML fairly fully, with checking of syntax and warnings about the abuse of elements. Some estimates* have placed the proportion of invalid, broken, or unusable HTML files as high as 98%.

XML

XML is the new kid on the block: the Extensible Markup Language. Although it builds on SGML, a lot of the available software is experimental or beta, and there are parts of the specification which really haven't been test-driven that hard yet.

SGML has lots of extra, optional bells and whistles which make it complex to handle in a Web environment, so XML has removed all these optional features to make a simpler system for use online. See the XML FAQ for details of what and where.

XML can even be used without a DTD, making it easy to write simple structures for once-off usage. Because it is easier to program for, XML parsers are smaller and faster than SGML ones, which means they are becoming embedded in browsers. If you have MSIE 5 or DocZilla, you can view the novel in an XML version with a CSS stylesheet.

DTDs

Not all DTDs are as easy to write as our example novel above. In many cases the document types can turn out to be quite complex when they are analysed.

A DTD is used to guide the editing and creation process, allowing the editor to understand whereabouts in the document structure you are. This way, it can ensure that you are only offered valid choices when you insert new markup (eg you can't insert a <title> half-way through a <para>). It also means that documents of the same type will always be compatible with one another, because they use the same structure. A DTD is also used by subsequent processors like converters and formatters, so that the program knows in advance what is expected.

For more details on anything discussed on this page, please mail info@silmaril.ie


* The late Yuri Rubinsky, speaking at SGML'96.