May 2004
CONTENTS OF THIS NEWSLETTER
INTRODUCTION
NEWS
XML FOR ARCHAEOLOGY: A ROUNDTABLE DISCUSSION
THE HERITAGE EXCHANGE PROTOCOL
AN AMATEUR ARCHAEOLOGIST'S USE OF XML
INTRODUCTION
Mark Bell
Welcome to newsletter number 3 for May 2004. Just after Easter I was at the
Computer Applications in Archaeology Conference in Prato, Italy. I was interested to
see that there were plenty of papers on XML and related subjects. Two calls for
feedback and information that circulated at the conference are listed in this newsletter.
I hope to bring you some more items arising from this CAA conference in future
newsletters.
Please note the important changes to the newsletter sign-up process given in the News
section.
The next newsletter is planned for July/August, depending on feedback and
contributions of course.
Mark Bell
NEWS
Important changes to the newsletter sign-up.
I have set up a new domain at www.archweb.co.uk and the newsletter will be sent out
from there in future. To change your subscription details you need to go to the new
site and log in. The username and password you used for the newsletter sign up has
been moved to the new site. Once you have logged in you can then go to mailing list
preferences and change your subscription options.
There is now an option to receive the newsletter either in HTML format or in plain
text. As the HTML format seems to mangle web addresses at the moment, I have set
everyone up to receive a plain text newsletter. I will let you know when this has been
fixed.
The old newsletters are archived on the site and there will soon be an option to leave
comments about the newsletters on the site.
(Technical note the new site is built using the phpWebSite content management
system, an open source product, written in PHP by Appalachian State University.
For further details see http://phpwebsite.appstate.edu/).
XML FOR ARCHAEOLOGY: A ROUNDTABLE DISCUSSION
Dr William Kilbride
Abstract:
Published in 1998, the eXtensible Markup Language (XML) promises to
be the first step towards the next generation Web, allowing
communities to design languages that suit their particular needs and
integrate them harmoniously into a general infrastructure. In the
five years since being launched, a number of disciplines have taken
the initiative by creating their own mark-up specifications to render
and process domain specific information. MathML is used to share
mathematical expressions; CML is used to share molecular information
in Chemistry and a number of languages now exist to share musical
notation.
In contrast, archaeologist have not risen to the challenges and
opportunities of XML to the same degree. Various projects and
services like HEIRPORT, ArcheoBlog, Spectrum and OASIS use XML to
generate, render or share files, but only do so under specific and
restricted conditions. XML tools have been relatively slow to develop
partly because the standards upon which they are based have also been
slow to emerge. Moreover, variations in organisational structures and
intellectual traditions mean that such tools and standards often have
only limited relevance and application.
This roundtable is intended as a discussion forum for those
interested in XML for archaeology. A position paper will present a
number of case studies of XML applications, and the management and
strategic context of these applications. It will highlight the
presumed benefits of a wider XML development as against the implied
costs, and will identify possible areas for long term, middle term
and short term development.
Participants will be presented with a number of discussion points to
which they will be asked to respond. The roundtable will end with a
series of recommendations on how XML can be exploited more fully for
archaeology.
Notate Bene: The success of this roundtable and its recommendations
depends in part on the expertise of the whole group present.
Participants will be asked to contribute to this discussion and are
expected to have a grasp of the issues in advance. An expert panel
will present some grounding in the topic, but the hope of the
organisers is to stimulate informed discussion.
Dr William Kilbride
Assistant Director
Archaeology Data Service
Dept of Archaeology
University of York
England YO1 7EP, UK
HEEP: - HISTORICAL ENVIRONMENT EXCHANGE PROTOCOL
THE FORUM FOR INFORMATION STANDARDS IN HERITAGE (FISH)
Dr Tyler Bell
Technical Director
Oxford ArchDigital Ltd
The Historical Environment Exchange Protocol:
A CRM-based Web Service for the querying, amalgamation and exchange
of heritage information between heterogeneous data sources.
The Forum for Information Standards in Heritage (FISH) has
commissioned The Historical Environment Exchange Protocol (HEEP) as
part of the FISH Toolkit, a series of XML-based applications designed
specifically for the heritage sector. The Historical Environment
Exchange Protocol forms the core of the Heritage Web Service, an open
standard designed to facilitate interoperability in the heritage
sector. The HEEP will be released in the summer of 2004; all
components of the FISH Toolkit will be available for public use by
September 2004.
The HEEP is a transport-independent architecture which standardises
the manner in which heritage information is queried by client
applications, and the format in which the requested data is
delivered. It also standardises how HEP-enabled servers report their
capabilities, and the format in which exceptions are reported. The
protocol does not dictate the manner, format or technical platform in
which the data is stored and managed. The Historical Environment
Exchange Protocol simply acts as a platform-independent 'connector'
between heritage datasets and the XML schema used to transport the
data.
The XML schema underlying The Historical Environment Exchange
Protocol is based on a mapping of the MIDAS standard to the CIDOC
Conceptual Reference Model (CRM). MIDAS is a data standard for
historic environment information, developed by English Heritage and
now maintained under the auspices of FISH. The CRM is a next-
generation semantic framework, developed specifically for "describing
the implicit and explicit concepts and relationships used in cultural
heritage documentation"; it is soon to be ratified as ISO Standard.
FISH is a consortium of UK heritage institutions formed in 2001 to
"co-ordinate, develop, maintain and promote standards for the
recording of heritage information". Contributing Organisations
include The National Trust, English Heritage, The Archaeology Data
Service, The Royal Commission on the Ancient and Historical Monuments
of Scotland, The Museum Documentation Association, and several
others.
The HEEP and other elements of the FISH Toolkit are being developed
by Oxford ArchDigital Ltd. Contributions and comments are welcome
throughout the development process. All questions should be
addressed to the FISH Toolkit Project Manger, Edmund Lee
edmund.lee@english-heritage.org.uk.
Further Information:
FISH: http://www.fish-forum.info
Oxford ArchDigital: http://oxarchdigital.com
MIDAS: http://www.jiscmail.ac.uk/files/FISH/web_midasintro.htm
The CRM: http://cidoc.ics.forth.gr/index.html
Note that HEEP "Historical Environment Exchange Protocol" was formally called HEEP
(Heritage Exchange Protocol).
AN AMATEUR ARCHAEOLOGIST'S USE OF XML
John Palmer
I would like to describe briefly the XML set-up that I use to
maintain the work-in-progress files for my research on the Roman
Purbeck stone industry.
Origins
This study began in 1996 when I found myself employed only three days
a week but fortunately not suffering a corresponding reduction of
income; I decided to use some of my spare time studying archaeology
at King Alfred's College in Winchester, in which city I had lived for
25 years. For my first long project (over the summer vacation) I put
forward a proposal in these terms:
Proposed area of study:
Shale, stone {and salt} industries of Purbeck in the Roman period
Primary sources:
Dorset County Museum collection
Poole and Wimborne museums
Sites: asking advice from County Museum
Secondary sources:
Royal Commission on Historic Monuments inventory, Dorset South-east,
1952
This being accepted (on the understanding that the braces round
{salt} meant that I would only go into this subject if I ran out of
sources for shale and stone), I spent some time that summer visiting
Dorset and exploring both the field and the library sources for the
Roman stone and shale industries. From this came to a paper which was
duly submitted as coursework. (You can read it at
www.palmyra.uklinux.net/purbeck1996.html).
It was about two years after this that (being fully employed once
more) I returned to the subject of the stone industry. By this time I
had dropped the Kimmeridge Shale industry, feeling that it was
already well covered by other workers. (For an introduction to
Kimmeridge Shale, try Calkin 1953.) On the other hand Purbeck Stone
was relatively neglected, the last major review of the subject being
30 years old (Beavis 1970). I determined from the start to put my
provisional findings and working notes on the World Wide Web, so that
others with related interests might note what I was doing and
hopefully make suggestions, corrections and comments and maybe join
in the project. I have certainly never regretted this decision, and I
am very grateful to the people who have shown interest and helped
guide my efforts.
At this point it would be well worth your while to view my current
presentation of the data at www.palmyra.uklinux.net/pur-preface.html.
You should bear in mind that in its present form it is far larger and
more complex than when I began it. At the beginning the matter on the
Web consisted of two files only: the database, being basically a list
of Roman Purbeck stone artefacts, and the bibliography, basically the
reference-list from my 1996 paper augmented by other citations which
I had added since then. These two were quite easy to maintain as HTML
files, being each no more than a few tens of thousands of characters
long.
The database was basically a long unordered list <ul>, containing
items of the following general kind:
<li>
<ul>
<li><b>name</b> ..identifying name of artefact.. </li>
<li><b>site</b> ..where found and when.. </li>
<li><b>publ</b> <a href="..">..reference to publication..</a>
</li>
<li><b>desc</b> ..description.. </li>
<!-- other properties of the artefact added here -->
</ul>
</li>
The bibliography was also a long <ul> but contained items like this
one:
<li>
<a name="Bidwell1979">Bidwell PT 1979</a>,
<em>The legionary Bath House and Basilica and Forum at Exeter</em>,
Exeter City Council and Univ of Exeter:
Exeter Archaeological Reports <b>1</b>
</li>
Most of the hrefs in the database were naturally to items in the
bibliography, but from the start I allowed myself unlimited
references from anywhere in my files, to anywhere else in my own
website and also to other resources on the WWW, as I felt then as now
that these were important guides that would assist my own analysis of
the data and could also be useful to other readers.
Soon these two simple files grew. By early 2000 I had moved to Dorset
and into semi-retirement. I was now on the actual country of my study
and had easy access to the excellent library of the Dorset Natural
History and Archaeological Society, which I had joined back in 1996.
The database had split into several files, such as
Mortars (stone grinding bowls)
Other vessels (baths, basins, etc.)
Other portable artefacts
Roofing tiles of stone
Paving material
Other architectural stone
Inscriptions
Quarry sites
etc.
Moreover the internal organisation of the data on each artefact had
become quite varied. Naturally there were often many publ citations
for each artefact, and often several desc descriptions, sometimes one
from each author cited. Not every artefact even had a distinctive
name; but new properties of items, like map-references, location
(i.e. in what museum), substance (real Purbeck marble, other Purbeck
stone, etc.), and date (1st century, etc.), had been added in many
cases. The number and order of these properties varied greatly, and
this was making it difficult to study and to update the data, which
by now were becoming a resource of some archaeological importance, as
I described in a paper in the Dorset Proceedings (Palmer 2001).
The problem of keeping these data in order was not assisted by the
fact that the syntax of HTML (any version) is designed for specifying
logical subdivisions of a text, but not the significant properties of
any particular kind of subject-matter (such as stone artefacts).
Although I was accustomed to using nsgmls (James Clark) to verify the
conformance of my HTML to the appropriate DTD (data type
description), I decided that I needed a DTD more closely related to
the subject I was studying.
This DTD took shape in the summer of 2001.
The articles recorded in each file constitute a collection. Each
article in the collection is an item. I allow myself to group the
items by inserting subheads at suitable points in the list, but this
is little more than a presentational device.
<!ELEMENT collection - - ( (subhead | item)* ) >
For convenience in defining elements in the DTD I introduce an
entity:
<!ENTITY % textvar "(#PCDATA|br|em|b|a|code|img)*" >
And also this entity, to allow myself some non-ascii characters:
<!ENTITY % ISOlat1 SYSTEM "/usr/html/sgml-lib/ISOlat1.ent">
%ISOlat1;
A subhead is just a few words:
<!ELEMENT subhead - - (%textvar;) >
An item, however, has a fixed structure in which the subdivisions
always appear in the same order. This to me is an important aid to
reading and understanding the data. (In the old HTML notation there
was nothing to enforce this order.)
<!ELEMENT item - - ( name?,
number?,
cat*,
site+,
grid*,
source*, publ*, desc*,
loc*, subst*, date*,
interp*, comment*, cont* ) >
Follow this link for meanings and uses of the inner elements.
You'll observe that an item must have a site, but all the other parts
are optional; more than one is allowed of all parts except name,
number and site. (Actually, number is not used at all and is only in
the DTD in case I should want to start cataloguing artefacts in the
style of the great corpuses (corpora?) like RIB (Collingwood and
Wright 1965).)
br, em and b are mere presentational devices and mean what they do in
HTML, i.e. linebreak, emphasise, and bold-face.
<!ELEMENT br - - (#PCDATA) --will normally be empty-->
<!ELEMENT em - - (%textvar;) >
<!ELEMENT b - - (%textvar;) >
a corresponds to its namesake in HTML and has some of the same
attributes. It is a bit old-fashioned in using "name" rather than
"id" for the label that is the target of a link.
<!ELEMENT a - - (%textvar;) >
<!ATTLIST a
href CDATA IMPLIED
name CDATA IMPLIED
target CDATA IMPLIED >
"target" is another merely presentational device: as in HTML, it
hints to the displaying program that it is worth opening a secondary
window. code is also presentational and corresponds to its namesake
in HTML.
<!ELEMENT code - - (%textvar;) >
img introduces a picture, as in HTML.
<!ELEMENT img - - (#PCDATA) --will normally be empty-->
All the elements listed above from br to img can be used inside any
of the elements listed below, which are the main categories of
information about an item. For the meaning and use of the latter, see
my website at http://www.palmyra.uklinux.net/.
<!ELEMENT name - - (%textvar;)
<!ELEMENT number - - (%textvar;) >
<!ELEMENT cat - - (%textvar;) >
<!ELEMENT site - - (%textvar;) >
<!ELEMENT grid - - (%textvar;) >
<!ELEMENT source - - (%textvar;) >
<!ELEMENT publ - - (%textvar;) >
<!ELEMENT desc - - (%textvar;) >
<!ELEMENT loc - - (%textvar;) >
<!ELEMENT subst - - (%textvar;) >
<!ELEMENT date - - (%textvar;) >
<!ELEMENT interp - - (%textvar;) >
<!ELEMENT comment - - (%textvar;) >
<!ELEMENT cont - - (%textvar;) >
<!--finis-->
(The above data-structure is sufficiently restrictive for my purpose,
which was to help me to be regular and consistent in the recording of
my data. Observant eyes will note that it does permit me to do some
things that make little sense, for instance to put one a element
inside another, or to insert some textual content into br or img
elements. However I feel no inclination to do these things and don't
need the added complication of the code necessary to forbid them.)
Having chosen a data-structure, the first problem was to convert the
existing HTML data to the new form, bearing in mind that the
component parts of each item had to be forced into a new order to fit
the restrictions of the new DTD. There are many ways of doing this,
and if mine seems odd, the reader should bear in mind that I was
familiar with programming in Perl and inclined to stick to the
techniques that I new best.
My ad-hoc program html2xml reads the HTML data and converts to the
new DTD; it uses the SGML parser nsgmls to convert the HTML to a
canonical form and creates a structure of Perl objects corresponding
to the elements of the HTML; these are then picked off in the
appropriate order to create new items with correctly ordered inner
parts. Apart from the time in 2001 when I first introduced the new
DTD, I have not used my html2xml again except on one occasion when I
removed (deleted) one of my XML files by mistake !
I now had my data stored in XML in my new DTD in files called *.xml.
From summer 2001 onwards, all amendments and additions to the data
have been made by editing the XML files; this has kept a degree of
discipline in my data which was hard to achieve using raw HTML. Of
course, every time I amend an XML file, I have to ensure good order
by validating it against the DTD described above; I do this with
nsgmls, which is so quick and convenient I can use it many times over
within a single data-entry session.
I have not attempted to put my XML on the Web directly, as I think it
is important not to assume that all my readers will be using the very
latest in Web-browsing software! In fact, after amending any of my
XML master-files, I create a corresponding file in HTML by means of a
Perl program which goes by the name updatehtml. (Although this
program will produce correct HTML provided that the master-file is
correct XML, I occasionally verify the generated HTML using nsgmls.)
The automatically-generated HTML is, at the time of writing, XHTML
1.0.
The conversion to HTML is much simpler than the conversion out of it,
for it involves little more than a succession of string-
substitutions, the style of which will be familiar to anyone who has
used Perl or any of its antecedent programs like sed or vi. The
program works on tags, not on elements, which is satisfactory in this
case provided that matching operations are performed on both the
start- and the end-tag for the same element.
For instance, <collection> becomes <ul>:
$_ =~ s/<collection>/<ul>/;
$_ =~ s/<\/collection>/<\/ul>/;
<item> becomes a <ul> inside a <li>:
$_ =~ s/<item>/<li>\ <ul>/;
$_ =~ s/<\/item>/<\/ul><\/li>/;
The various parts of an item are all treated alike: first the start-
tags:
$_ =~ s/<name> */<li><b>name<\/b> /;
$_ =~ s/<number> */<li><b>number<\/b> /;
$_ =~ s/<cat> */<li><b>cat<\/b> /;
$_ =~ s/<site> */<li><b>site<\/b> /;
$_ =~ s/<grid> */<li><b>grid<\/b> /;
$_ =~ s/<source> */<li><b>source<\/b> /;
$_ =~ s/<publ> */<li><b>publ<\/b> /;
$_ =~ s/<desc> */<li><b>desc<\/b> /;
$_ =~ s/<loc> */<li><b>loc<\/b> /;
$_ =~ s/<subst> */<li><b>subst<\/b> /;
$_ =~ s/<date> */<li><b>date<\/b> /;
$_ =~ s/<interp> */<li><b>interp<\/b> /;
$_ =~ s/<comment> */<li><b>comment<\/b> /;
$_ =~ s/<cont> */<li><b>cont<\/b> /;
and the end-tags:
$_ =~ s/<\/name>/<\/li>/;
$_ =~ s/<\/number>/<\/li>/;
$_ =~ s/<\/cat>/<\/li>/;
$_ =~ s/<\/site>/<\/li>/;
$_ =~ s/<\/grid>/<\/li>/;
$_ =~ s/<\/source>/<\/li>/;
$_ =~ s/<\/publ>/<\/li>/;
$_ =~ s/<\/desc>/<\/li>/;
$_ =~ s/<\/loc>/<\/li>/;
$_ =~ s/<\/subst>/<\/li>/;
$_ =~ s/<\/date>/<\/li>/;
$_ =~ s/<\/interp>/<\/li>/;
$_ =~ s/<\/comment>/<\/li>/;
$_ =~ s/<\/cont>/<\/li>/;
As hinted before, I try to remain compatible with older browsers
while not neglecting new W3C recommendations, so I ensure that each a
element that is the target of a link has both "id" and "name"
attributes, both with the same value:
$_ =~ s/ name=(".*?")/ id=$1 name=$1/g; # 2003-01-14
It remains for the program to copy the front- and back-matter from
the old version of the HTML file, changing only minor details (most
importantly the date of revision wherever it appears.)
Spotmaps
The front matter of many of my HTML files includes a sketch-map of
the province of Britannia indicating the geographical distribution of
the relevant class of artefacts. This is generated from the XML files
in the following way: a program spotmap scans the file for grid-
references (element grid), and generates TeX code that places a
suitable symbol at the appropriate spot on the map according to the
National Grid. The map is then drawn and annotated using TeX,
including a coastal outline, which was obtained from the website of
the (United States) National Oceanic and Atmospheric Administration
and is stated by NOAA to be in the public domain.
(Owing to differences of geographic projection between these data and
the British National Grid, there may be small errors in the placement
of some points on the maps. This will ultimately put a limit to the
usability of the NOAA coastline data. Coastlines on a true National
Grid basis can be obtained from the British Ordnance Survey, but at
present I prefer to avoid their licensing procedures and possible
charges.)
TeX is of course the typesetting program devised by Donald Knuth. For
an introduction try http://www.tug.org/, the site of the TeX User
Group.
Printable versions
Just a few words on the most recent enhancement. Besides the Web-
presentation of my data I need a printed version in a ring-binder,
which I can carry about with me and refer to when working in a
library or in the field. I began by using the printing facilities of
my Web-browser, but rapidly felt the need for something that would
rewrite the data in a more compact form. I now have another Perl
program that rewrites the XML data as input to LaTeX, which gives a
more compact layout than anything I've managed to achieve using a
Web-browser; it reduced the thickness of the file I sometimes carry
about by about a half.
(LaTeX is an application of TeX, invented by Leslie Lamport (LaTeX, a
document preparation system, 2nd ed., Addison-Wesley 1996) which has
been much extended by the later contributions of users.)
One word about XSL, XSLT and all that : I feel somewhat guilty about
not having used them but I really haven't yet felt the need. I find
that as I'm reasonably fluent in Perl, and my source XML has a very
simple structure, I can more easily make an ad-hoc program in Perl to
convert the XML to whatever I want. One incidental benefit is that I
can even carry the comments in the XML over into the output file!
Future developments
At the moment my bibliography is kept as a simple HTML file which is
hand-edited rather than created from a source file in some other
notation. This has been satisfactory up till now, but the increasing
size of the bibliography (it now holds over 600 citations) makes me
think of improvements.
The ideal would be to rewrite the bibliography as a BibTeX database,
from which I could generate
1. a full listing typeset with LaTeX,
2. a HTML version of the above, for presentation on the Web, and
3. a list of references for any paper I may write (using LaTeX of
course), in whatever style I (or the journal I was aiming at) wanted.
The main thing that has made me defer this plan till now is that
converting the bibliography from the present HTML form to BibTeX is
not straightforward and cannot be done by a simple conversion
program; the problem is analogous to converting a page-description
(such as a wordprocessor document or a PostScript file) to a
logically structured notation (such as LaTeX, or a relational
database). Probably I should take this task in hand before the
bibliography becomes any larger!
END OF NEWSLETTER 3
This newsletter is copyright © Mark Bell and the individual authors, 2004.
Please contact the editor before reproducing material from this newsletter.