Writing Descriptions

Robert J. Glushko

60 Writing Descriptions

Suppose that I am organizing books, and I have decided that it is important for the purposes of this organizing to know the title of each book and how many pages it has. Before me I have a book, which I examine to determine that its title is Die Ringe des Saturn and it has 371 pages. Example: Basic ways of writing part of a book description. lists a few of the ways to write this description. Let us examine these various forms of writing to see what they have in common and where they differ.

Basic Ways Of Writing Part Of A Book Description

The title is Die Ringe des Saturn and it has 371 pages.

{ book: {"title":"Die Ringe des Saturn","pages":371} }

<book pages="371"> <title>Die Ringe des Saturn</title> </book>

<div class="book">The title is  <span class="title">Die Ringe des Saturn</span> and it has <span class="pages">371 pages.</span> </div>

<http://lccn.loc.gov/96103072> <http://rdvocab.info/Elements/title> "Die Ringe des Saturn"@de ; <http://rdvocab.info/Elements/extentOfText> "371 p." .

We examine the notations, writing systems and syntax of each of these description forms, and others, in the following sections.

Notations

First, let us look at the actual marks on the page. To write you must make marks or—more likely—select from a menu of marks using a keyboard. In either case, you are using a notation: a set of characters with distinct forms.^[1] The Latin alphabet is a notation, as are Arabic numerals. Some more exotic notations include the symbols used for editorial markup and alchemical symbols.^[2] The characters in a notation usually have an ordering. Arabic numerals are ordered 1 2 3 and so on. English-speaking children usually learn the ordering of the Latin alphabet in the form of an alphabet song.^[3]

A character may belong to more than one notation. The examples in Example: Basic ways of writing part of a book description. use characters from a few different notations: the letters of the Latin alphabet, Arabic numerals, and a handful of auxiliary marks: . { } " :< > / $ Collectively, all of these characters—alphabet, numerals, and auxiliary marks—also belong to a notation called the American Standard Code for Information Interchange(ASCII).^[4]

ASCII is an example of a notation that has been codified and standardized for use in a digital environment. A traditional notation like the Latin alphabet can withstand a certain degree of variation in the form of a particular mark. Two people might write the letter A rather differently, but as long as they can mutually recognize each other’s marks as an “A,” they can successfully share a notation. Computers, however, cannot easily accommodate such variation. Each character must be strictly defined. In the case of ASCII, each character is given a number from 0 to 127, so that there are 128 ASCII characters.^[5] When using a computer to type ASCII characters, each key you press selects a character from this “menu” of 128 characters. A notation that has had numbers assigned to its characters is called a character encoding.

ASCII

	0	1	2	3	4	5	6	7
0	NUL	DLE	space	0	@	P	`	p
1	SOH	DC1	!	1	A	Q	a	q
2	STX	DC2	“	2	B	R	b	r
3	ETX	DC3	#	3	C	S	c	s
4	EOT	DC4	$	4	D	T	d	t
5	ENQ	NAK	%	5	E	U	e	u
6	ACK	SYN	&	6	F	V	f	v
7	BEL	ETB	‘	7	G	W	g	w
8	BS	CAN	(	8	H	X	h	x
9	HT	EM	)	9	I	Y	i	y
A	LF	SUB	*	:	J	Z	j	z
B	VT	ESC	+	;	K	[	k	{
C	FF	FS	,	<	L	\	l	\|
D	CR	GS	–	=	M	]	m	}
E	SO	RS	.	>	N	^	n	~
F	SI	US	/	?	O	_	o	DEL

The most ambitious character coding in existence is Unicode, which as of version 6.0 assigns numbers to 109,449 characters.^[6] Unicode makes the important distinction between characters and glyphs. A character is the smallest meaningful unit of a written language. In alphabet-based languages like English, characters are letters; in languages like Chinese, characters are ideographs. Unicode treats all of these characters as abstract ideas (Latin capital A) rather than specific marks (A A A A). A specific mark that can be used to depict a character is a glyph. A font is a collection of glyphs used to depict some set of characters. A Unicode font explicitly associates each glyph with a particular number in the Unicode character encoding. The inability of computers to use contextual understanding to bridge the gap between various glyphs and the abstract character depicted by those glyphs turns out to have important consequences for organizing systems.

Different notations may include very similar marks. For example, modern music notation includes marks for indicating the pitch of note, known as accidentals. One of these music notation marks is ♯ (“sharp”). The sharp sign looks very much like the symbol used in English as an abbreviation for the word number, as in We’re #1!^[7] If you were to write a sharp sign and a number sign by hand, they would probably look identical. In a non-digital environment, we would rely on context to understand whether the written mark was being used as part of music notation, or mathematical notation, or as an English abbreviation.

Computers, however, have no such intuitive understanding of context. Unicode encodes the number sign and the sharp sign as two different characters. As far as a computer using Unicode is concerned, ♯ and # are completely different, and the fact that they have similar-looking glyphs is irrelevant. That is a problem if, for example, a cataloger has carefully described a piece of music by correctly using the sharp sign, but a person looking for that piece of music searches for descriptions using the number sign (since that is what you get when you press the keyboard button with the symbol that most closely resembles a sharp sign).^[8]

Writing Systems

A writing system employs one or more notations, and adds a set of rules for using them. Most writing systems assume knowledge of a particular human language. These writing systems are known as glottic writing systems. But there are many writing systems, such as mathematical and musical ones, that are not tied to human languages in this way. Many of the writing systems used for describing resources belong to this latter group, meaning that (at least in principle) they can be used with equal facility by speakers of any language.

Glottic writing systems, being grounded in natural human languages, are difficult to describe precisely and comprehensively. Non-glottic writing systems, on the other hand, can be described precisely and comprehensively using an abstract model. That is the connection between the structural perspective taken in the previous section, and the textual perspective taken in this section. A non-glottic writing system is described by a particular metamodel, and structures that fit within the constraints of a given metamodel can be textually represented using one or more writing systems that are described by that metamodel.

Some writing systems are closely identified with specific metamodels. For example, XML and JSON are both 1) metamodels for structuring information and 2) writing systems for textually representing information. In other words, they specify both the abstract structure of a description and how to write it down. It is possible to conceive of other ways to textually represent the structure of these metamodels, but for each of these metamodels just one writing system has been standardized.^[9]

RDF, on the other hand, is only a metamodel, not a writing system. RDF only defines an abstract structure, not how to write that structure. So how do we write information that is structured as RDF? It turns out that we have many choices. Unlike XML and JSON, several different writing systems for the RDF metamodel have been standardized, including N-Triples, Turtle, RDFa, and RDF/XML.^[10] Each of these is a writing system that is abstractly described by the RDF metamodel.

Writing systems provide rules for arranging characters from a notation into meaningful structures. A character in a notation has no inherent meaning. Characters in a notation only take on meaning in the context of a writing system that uses that notation. For example: what does the letter I from the Latin alphabet mean? That question can only be answered by looking at how it is being used in a particular writing system. If the writing system is American English, then whether I has a meaning depends on whether it is grouped with other letters or whether it stands alone. Only in the latter case does it have an assignable meaning. However in the arithmetic writing system of ancient Rome, which also uses as a notation the letters of the Latin alphabet, I has a different meaning: one.

This example also serves to illustrate how the ordering of a notation can differ from the ordering of a writing system that uses that notation. According to the ordering of the Latin alphabet, the twelfth letter L comes before the twenty-second letter V. But in the Roman numeric writing system, V (the number 5) comes before L (the number 50). Unless we know which ordering we are using, we cannot arrange L and V “in order.”^[11]

Roman Numerals

Roman Number	Arabic Number
I	1
V	5
X	10
L	50
C	100
D	500
M	1000

This kind of difference in ordering can arise in more subtle ways as well. When we alphabetically order names, we first compare the first character of each name, and arrange them according to the ordering of the writing system. The first known use of alphabetical ordering was in the Library of Alexandria about two thousand years ago, when Zenodotus arranged the collection according to the first letter of resource names.^[12] If the first characters of two names are the same, we compare the second character, and so on. We can also apply this same kind of ordering procedure to sequences of numerals. If we do, then 334 will come before 67, because 3 (the first character of the first sequence) comes before 6 (the first character of the second sequence) according to the ordering of our notation (Arabic numerals). However, it is more common when ordering sequences of numerals to treat them as decimal numbers, and thus to use the ordering imposed by the decimal system. In the decimal writing system, 67 precedes 334, since the latter is a greater number.

This difference is important for organizing systems. Computers will sort values differently depending on whether they are treating sequences of numerals as numbers or just as sequences. Some organizing systems mix multiple ways of ordering the same characters. For example, Library of Congress call numbers have four parts, and sequences of Arabic numerals can appear in three of them. In the second part, indicating a narrow subject area, and fourth part, indicating year of publication, sequences of numerals are treated as numbers and ordered according to the decimal system. In the third part, however, sequences of numerals are treated as sequences and ordered “notationally” as in the example above (334 before 67).

Differences in ordering demonstrate just one way that multiple writing systems may use the same notation differently. For example, the American English and British English writing systems both use the same Latin alphabet, but impose slightly different spelling rules.^[13] The Japanese writing system employs a number of notations, including traditional Chinese characters (kanji) as well as the Latin alphabet (rōmaji). Often, writing systems do not share the same exact notation but have mostly overlapping notations. Many European languages, for example, extend the Latin alphabet with characters such as Å and Ü that add additional marks, known as diacritics, to the basic characters.^[14]

In organizing systems it is often necessary to represent values from one writing system in another writing system that uses a different notation, a process known as transliteration. For example, early computer systems only supported the ASCII notation, so text from writing systems that extend the Latin alphabet had to be converted to ASCII, usually by removing (or sometimes transliterating) diacritics. This made the non-ASCII text usable in an ASCII-based computerized organizing system, at the expense of information loss.

Even in modern computer systems that support Unicode, however, transliteration is often needed to support organizing activities by users who cannot read text written using its original system. The Library of Congress and the American Library Association provide standard procedures for transliterating text from over sixty different writing systems into the (extended) Latin alphabet.

Syntax

The examples in Example: Basic ways of writing part of a book description. express the same information using different writing systems. The examples use the same notation (ASCII) but differ in their syntax: the rules that define how characters can be combined into words and how words can be combined into higher-level structures.^[15]

Consider the first entry: The title is Die Ringe des Saturn and it has 371 pages. The leading capital letter and the period ending this sequence of characters indicate to us that this is a sentence. This sentence is one way we might use the English writing system to express two statements about the book we are describing. A statement is one distinct fact or piece of information. In glottic writing systems like English, there is usually more than one sentence we could write to express the same statement. For example, instead of it has 371 pages we might have written the number of pages is 371. English writing also enables us to construct complex sentences that express more than one statement.^[16]

In contrast, when we create descriptions of resources in an organizing system, we generally use non-glottic writing systems in which each sentence only expresses a single statement, and there is just one way to write a sentence that expresses a given statement.^[17] These restrictions make these writing systems less expressive, but simplify their use. In particular, since there is a one-to-one correspondence between sentences and statements, we can drop the distinction and just talk about the statements of a description.

Now we return to our example and look at the structure of the statement, The title is Die Ringe des Saturn and it has 371 pages. Spaces are used to separate the text into words, and English syntax defines the functions of those words. The verb is in this statement functions to link the word title to the phrase Die Ringe des Saturn. This is typical of the kind of statements found in a resource description. Each statement identifies and describes some aspect of the resource. In this case, the statement attributes the value Die Ringe des Saturn to the property title.

As we saw when we looked at description structures, we can analyze descriptions as involving properties of resources and their corresponding values or content. In a writing system like English, it is not always so straightforward to determine which words refer to properties and which refer to values. (This is why blobs are not ideal description structures.) Writing systems designed for expressing resource descriptions, on the other hand, usually define syntax that makes this determination easier. In our dictionary examples above, we used an arrow character → to indicate the relationship between properties and values.

This ease of distinguishing properties and values comes at a price, however. The syntax of English is forgiving: we can read a sentence with somewhat garbled syntax such as 371 pages it has and often still make out its meaning.^[18] This is usually not the case with writing systems intended for expressing resource descriptions. These systems strictly define their rules for how characters can be combined into higher-level structures. Structures that follow the rules are well formed according to that system.
Take for example the second entry in Example: Basic ways of writing part of a book description..
```
{ book: {"title":"Die Ringe des Saturn","pages":371} }
```
This fragment is written in JSON. As explained earlier in this chapter, JSON is a metamodel for structuring information using lists and dictionaries. But JSON is also a writing system, which borrows its syntax from JavaScript. The JSON syntax uses brackets to textually represent lists [1,2,3] and braces to textually represent dictionaries {title:"Die Ringe des Saturn", "pages":371}. Within braces, the colon character : is used to link properties with their values, much as is was used in the previous example. So "pages":371 is a statement assigning the value 371 to the property pages.
The third fragment is written in XML.
```
<book pages="371"> <title>Die Ringe des Saturn</title> </book>
```
Like JSON, XML is a metamodel and also a writing system. Here we have XML elements and attributes. XML elements are textually represented as tags that are marked using the special characters <, > and /. So, this fragment of XML consists of a book element with a child element, title, and a pages attribute, each of which has some text content. In this case, pages="371" is a statement assigning the value 371 to the property pages. The difference is syntax is subtle; quotation marks surround the value and equal sign = is used to assign the property to its value.
The fourth is a fragment of HTML.
```
<div class="book">The title is  <span class="title">Die Ringe des Saturn</span> and it has <span class="pages">371 pages.</span> </div>
```
The writing system that HTML employs is close enough to XML to ignore any differences in syntax. In this example, the CLASS attribute contains the property name and the property value is the element content.
The fifth entry is a fragment of Turtle, one of the writing systems for RDF.
```
<http://lccn.loc.gov/96103072> <http://rdvocab.info/Elements/title> "Die Ringe des Saturn"@de ; <http://rdvocab.info/Elements/extentOfText> "371 p." .
```
Turtle provides a syntax for writing down RDF triples. Each triple consists of a subject, predicate, and object separated by spaces. Recall that RDF uses URIs to identify subjects, predicates, and some objects; these URIs are written in Turtle by enclosing them in angle brackets < >. Triples are separated by period . characters, but triples that share the same subject can be written more compactly by writing the subject only once, and then writing the predicate and object of each triple, separated by a semicolon ; character. This is what we see in Example: Nesting an author description within a book description: two triples that share a subject.

The two fragments in Example: Writing part of a book description in Semantic XML. demonstrate namespaces, terms from the Dublin Core and DocBook namespaces, and the facility with which XML embraces semantic encoding of description resources.

Writing Part Of A Book Description In Semantic XML

<book xmlns:dc="http://purl.org/dc/terms/" dc:extent="371 p."> <dc:title>Die Ringe des Saturn</title> ... </book>

<book xmlns:db="http://www.docbook.org/xml/4.5/docbookx.dtd"> <bookinfo> <title>Die Ringe des Saturn</title> <pagenums>371 p.</pagenums>...</bookinfo> ... </book>

The first example extends the third fragment from Example: Basic ways of writing part of a book description.; the xmlns:dc="..." segment is a namespace declaration, which is associating dc with the quoted URI, which happens to be the Dublin Core Metadata Initiative(DCMI); the child <dc:title> element and the attached dc:extent="371" tell us that the corresponding values are attributable to the title and extent properties, respectively, from the Dublin Core namespace.
The next fragment employs DocBook DTD namespace; we now have a <pagenums> element for which the meaning is contextually obvious; the title is still a title; an extra layer of markup reflects the fact that it could be metadata in the source file of a book that is being edited, is in production or is on your favorite tablet right now.^[19]

When Tim Berners-Lee deployed HTML, its syntax contained the basic elements and attributes needed to make formal statements about the document as a whole by using <LINK/>, or about specific parts of the document by using the <A> element. Each of these elements have four attributes in common: the famous HREF attribute contains a URI that names an object resource; the NAME attribute allows the element to be the target end of a link; the REL and REV attributes contain descriptions of the link relations.

Microformats, RDFa and Microdata are the latest generation of metadata extensions to HTML. Each approach is widely used on the web and by search engines. As such, they are potential targets when transforming into HTML from richer semantic formats.

Microformats are the simplest of the three. It uses controlled vocabularies of terms in REL/REV, and in the CLASS attribute, to declare high-level information types.

RDFa is RDF in Attributes. That is, RDFa is a formal specification for writing RDF expressions by using attributes in XML and HTML documents. It uses an ABOUT attribute to name the subject of the relation; the REL and REV attributes; HREF is joined by SRC and RESOURCE to name the object of the link; a TYPEOF attribute declares a type; PROPERTY and CONTENT attributes are used to attribute a value to an object’s property.

Microdata is similar, inasmuch as it uses attributes extensively. The presence of an ITEMSCOPE attribute identifies an item while the ITEMTYPE attribute value identifies its type; ITEMID declares an items name or unique identifier; ITEMPROP is a name value pair, and; ITEMREF relates this item to other elements that are outside of the scope of the container element.

The two fragments in Example: Writing part of a book description in RDFa or microdata. demonstrate RDFa and microdata formats, which each rely upon specific attributes to establish the type of the property values contained by the HTML elements. In each example, the book title is contained by a <span> element. Whereas RDFa relies upon the property attribute, the microdata example employs the itemprop attribute to specify that the contents of the element is, effectively, a “title” in exactly the same sense as we know that the contents of <dc:title> is a “title.”

Writing Part of a Book Description in RDFa or Microdata

<div class="book">The title is <span property="http://purl.org/dc/terms/title">Die Ringe des Saturn</span> and it has <span property="http://purl.org/dc/terms/extent">371 p.</span></div>

<div itemscope itemtype="book">The title is <span itemprop="http://purl.org/dc/terms/title">Die Ringe des Saturn</span> and it has <span itemprop="http://purl.org/dc/terms/extent">371 p.</span></div>

The terminology here and in the following sections comes from (Harris 1996).

↵
See http://unicode.org/charts/PDF/U1F700.pdf.

↵
Entitled “The ABC,” the song was copyrighted in 1835 by Boston music publisher Charles Bradlee. It is sung to a tune that was originally developed by Wolfgang Amadeus Mozart, and is commonly recognizable as Twinkle, Twinkle, Little Star.

↵
http://tools.ietf.org/html/rfc20.

↵
Only 95 of these characters are actually “marks” in the sense of being visible and printable. The other 33 ASCII characters are “control codes” that indicate things like horizontal and vertical tabs, the ends of printed lines, form feeds, and transmission control. We can think of many of these as special auxiliary marks, similar to the kind of symbols editors and proofreaders use to annotate texts.

↵
The Unicode standard is maintained by a global non-profit organization. Everything you need to know is at http://www.unicode.org/.

↵
The Chinese character 井 (water well) looks like the # character too. The # symbol was historically used to denote pounds, the Imperial unit of weight, as in 10# of potatoes. In the United Kingdom, the # character is called“hash.” We could go on, but we will leave it to you to discover more.

↵
To add to the confusion, while the American standard (ASCII) places the # character at position 23, the British equivalent (BS 4730) places the currency symbol £ at the same position. As a result, improperly configured computers sometimes display # in place of £ and vice versa.

↵
Recently, an alternative writing system for XML-structured data has been standardized: Efficient XML Interchange(EXI). However it is not yet widely used.

↵
RDF/XML is a bit confusing; it is a writing system that uses XML syntax to textually represent RDF structure. This means that while XML tools can read and write RDF/XML, they cannot manipulate the graph structures it represents, because they were designed to work with XML’s tree structures.

↵
Although we use alphabetic characters today to represent Roman numerals, originally they were represented by unique symbols.

↵
It took a few hundred years before alphabetization became recursive and applied to letters other than the first (Casson 2002, p. 37). Alphabetization relies on the ordering of the writing system, not the notation. For example, Swedish and German are two writing systems that assign different orderings to the same notation.

↵
For example, the American spelling of the words “center” and “color” contrasts slightly with the English spelling of “centre” and “colour.” There are too many examples to include here. Wikipedia has a comprehensive analysis of American and British spelling differences at http://en.wikipedia.org/wiki/American_and_British_English_spelling_differences.

↵
ASCII’s 128 characters are insufficient to represent these more complex character sets, so a new family of character encodings was created, ISO-8859, in which each encoding enumerates 256 characters. Each encoding thus has more space to accommodate the additional characters of regionally-specific notations. ISO 8859-5, for example, has extensions to support the Cyrillic alphabet.

↵
In discussions of glottic writing systems, “syntax” usually refers only to the rules for combining words into sentences. In discussions of programming languages, “syntax” has the broader sense we use here.

↵
Compound sentences contain two independent clauses joined by a conjunction, such as “and,” “or,” “nor,” “but.” For example: I went to the store and I bought a book.” Complex sentences contain an independent clause joined by one or more dependent clauses. For example: “I read the book that I bought at the store.”

↵
In truth, even non-glottic writing systems designed to encode resource descriptions unambiguously can have variant forms of the same statement. For example, XML permits some variation in the way the same Infoset may be textually represented. Often these variations involve the treatment of content that may under some circumstances be treated as optional, such as white space. The difference is that in writing systems designed for resource description, these variations can be precisely enumerated and rules developed to reconcile them, while this is not generally true for glottic writing systems.

↵
Fortunately for Yoda. There are many web services for converting English to Yoda-speak; an example is http://www.yodaspeak.co.uk/.

↵
DocBook (Walsh 2010) is widely used to publish academic, commercial, industrial book, scientific, and computing book, papers and articles. The book that you are reading is encoded with DocBook markup; complete bibliographic information for the book is contained within the source files, ready to be extracted on the way into one of the latest ebook formats.

↵

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Discipline of Organizing: 4th Professional Edition Copyright © 2020 by Robert J. Glushko is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.

Notations

Writing Systems

Syntax

License

Share This Book