59 Structuring Descriptions
Choosing how to structure resource descriptions is a matter of making principled and purposeful design decisions in order to solve specific problems, serve specific purposes, or bring about some desirable property in the descriptions. Most of these decisions are specific to a domain: the particular context of application for the organizing system being designed and the kinds of interactions with resources it will enable. Making these kinds of context-specific decisions results in a model of that domain. (See “Abstraction in Resource Description”.)
Over time, many people have built similar kinds of descriptions. They have had similar purposes, desired similar properties, and faced similar problems. Unsurprisingly, they have converged on some of the same decisions. When common sets of design decisions can be identified that are not specific to any one domain, they often become systematized in textbooks and in design practices, and may eventually be designed into standard formats and architectures for creating organizing systems. These formally recognized sets of design decisions are known as abstract models or metamodels. Metamodels describe structures commonly found in resource descriptions and other information resources, regardless of the specific domain. While any designer of an organizing system will usually create a model of her specific domain, she usually will not create an entirely new metamodel but will instead make choices from among the metamodels that have been formally recognized and incorporated into existing standards. The resulting model is sometimes called a “domain-specific language.” Reusing standard metamodels can bring great economical advantages, as developers can reuse tools designed for and knowledge about these metamodels, rather than having to start from scratch.
In the following sections, we examine some common kinds of structures used as the basis for metamodels. But first, we consider a concrete example of how the structure of resource descriptions supports or inhibits particular uses. As we explained in Foundations for Organizing Systems, the concept of a resource de-emphasizes the differences between physical and digital things in favor of focusing on how things, in general, are used to support goal-oriented activity. Different kinds of books can be treated as information resources regardless of the particular mix of tangible and intangible properties they may have. Since resource descriptions are also information resources, we can similarly consider how their structures support particular uses, independent of whether they are physical, digital, or a mix of both.
During World War II, a British chemist named W. E. Batten developed a system for organizing patents.[1] The system consisted of a language for describing the product, process, use, and apparatus of a patent, and a way of using punched cards to record these descriptions. Batten used cards printed with matrices of 800 positions (see Figure: A Batten Card.). Each card represented a specific value from the vocabulary of the description language, and each position corresponded to a particular patent. To describe patent #256 as covering extrusion of polythene to produce cable coverings, one would first select the cards for the values polythene, extrusion, and cable coverings, and then punch each card at the 256th position. The description of patent #256 would thus extend over these three cards.
The advantage of this structure is that to find patents covering extrusion of polythene (for any purpose), one needs only to select the two cards corresponding to those values, lay one on top of the other, and hold them up to a light. Light will shine through wherever there is a position corresponding to a patent described using those values. Patents meeting a certain description are easily found due to the structure of the cards designed to describe the patents.
Of course, this system has clear disadvantages as well. Finding the concepts associated with a particular patent is tedious, because every card must be inspected. Adding a new patent is relatively easy as long as there is an index that allows the cards for specific concepts to be located quickly. However, once the cards run out of space for punching holes, the whole set of cards must be duplicated to accommodate more patents: a very expensive operation. Adding new concepts is potentially easy: simply add a new card. But if we want to be able to find existing patents using the new concept, all the existing patents would have to be re-examined to determine whether their positions on the new card should be punched: also an expensive operation.
The structure of Batten’s cards supported rapid selection of resources given a partial description. The kinds of structures we will examine in the following sections are not quite so elaborate as Batten’s cards. But like the cards, each kind of structure supports more efficient mechanical execution of certain operations, at the cost of less efficient execution of others.
Kinds of Structures
Sets, lists, dictionaries, trees, and graphs are kinds of structures that can be used to form resource descriptions. As we shall see, each of these kinds is actually a family of related structures. These structures are abstractions: they describe formal structural properties in a general way, rather than specifying an exact physical or textual form. Abstractions are useful because they help us to see common properties shared by different specific ways of organizing information. By focusing on these common properties, we can more easily reason about the operations that different forms support and the affordances that they provide, without being distracted by less relevant details.
Blobs
The simplest kind of structure is no structure at all. Consider the following description of a book: Sebald’s novel uses a walking tour in East Anglia to meditate on links between past and present, East and West.[2] This description is an unstructured text expression with no clearly defined internal parts, and we can consider it to be a blob. Or, more precisely, it has structure, but that structure is the underlying grammatical structure of the English language, and none of that grammatical structure is explicitly represented in a surface structure when the sentence is expressed. As readers of English we can interpret the sentence as a description of the subject of the book, but to do this mechanically is difficult.[3] On the other hand, such a written description is relatively easy to create, as the describer can simply use natural language.
A blob need not be a blob of text. It could be a photograph of a resource, or a recording of a spoken description of a resource. Like blobs of text, blobs of pixels or sound have underlying structure that any person with normal vision or hearing can understand easily.[4] But we can treat these blobs as unstructured, because none of the underlying structure in the visual or auditory input is explicit, and we are concerned with the ways that the structures of resource descriptions support or inhibit mechanical or computational operations.[5]
Sets
The simplest way to structure a description is to give it parts and treat them as a set. For example, the description of Sebald’s novel might be reformulated as a set of terms: Sebald, novel, East Anglia, walking, history. Doing this has lost much of the meaning, but something has been gained: we now can easily distinguish Sebald and walking as separate items in the description.[6] This makes it easier to find, for example, all the descriptions that include the term walking. (Note that this is different from simply searching through blob-of-text descriptions for the word walking. When treated as a set, the description Fiji, fire walking, memoir does not include the term walking, though it does include the term fire walking.)
Sets make it easy to find intersections among descriptions. Sets are also easy to create. In “Classification vs. Tagging” we looked at “folksonomies,” organizing systems in which non-professional users create resource descriptions. In these systems, descriptions are structured as sets of “tags.” To find resources, users can specify a set of tags to obtain resources having descriptions that intersect at those tags. This is more valuable if the tags come from a controlled vocabulary, making intersections more likely. But enforcing vocabulary control adds complexity to the description process, so a balance must be struck between maximizing potential intersections and making description as simple as practical.[7]
A set is a type or class of structure. We can refine the definition of different kinds of sets by introducing constraints. For example, we might introduce the constraint that a given set has a maximum number of items. Or we might constrain a set to always have the same number of items, giving us a fixed-size set. We can also remove constraints. Sets do not contain duplicate items (think of a tagging system in which it does not make sense to assign the same tag more than once to the same resource). If we remove this uniqueness constraint, we have a different structure known as a “bag” or “multiset.”
Lists
Constraints are what distinguish lists from sets. A list, like a set, is a collection of items with an additional constraint: their items are ordered. If we were designing a tagging system in which it was important that the order of the tags be maintained, we would want to use lists, not sets. Unlike sets, lists may contain duplicate items. In a list, two items that are otherwise the same can be distinguished by their position in the ordering, but in a set this is not possible. For example, we might want to organize the tags assigned to a resource, listing the most used tag first, the least frequently used last, and the rest according to their frequency of use.
Again, we can introduce constraints to refine the definition of different kinds of lists, such as fixed-length lists. If we constrain a list to contain only items that are themselves lists, and further specify that these contained lists do not themselves contains lists, then we have a table (a list of lists of items). A spreadsheet is a list of lists.
Dictionaries
One major limitation of lists and sets is that, although items can be individually addressed, there is no way to distinguish the items except by comparing their values (or, in a list, their positions in the ordering). In a set of terms like Sebald, novel, East Anglia, walking, history, for example, one cannot easily tell that Sebald refers to the author of the book while East Anglia and walking refer to what it is about. One way of addressing this problem is to break each item in a set into two parts: a property and a value. So, for example, our simple set of tags might become author: Sebald, type: novel, subject: East Anglia, subject: walking, subject: history. Now we can say that author, type, and subject are the properties, and the original items in the set are the values.
- author
-
Sebald
- type
-
novel
- subject1
-
East Anglia
- subject2
-
walking
- subject3
-
history
This kind of structure is called a dictionary, a map or an associative array. A dictionary is a set of property-value pairs or entries. It is a set of entries, not a list of entries, because the pairs are not ordered and because each entry must have a unique key.[8] Note that this specialized meaning of dictionary is different from the more common meaning of “dictionary” as an alphabetized list of terms accompanied by sentences that define them. The two meanings are related, however. Like a “real” dictionary, a dictionary structure allows us to easily find the value (such as a definition) associated with a particular property or key (such as a word). But unlike a real dictionary, which orders its keys alphabetically, a dictionary structure does not specify an order for its keys.[9]
Dictionaries are ubiquitous in resource descriptions. Structured descriptions entered using a form are easily represented as dictionaries, where the form items’ labels are the properties and the data entered are the values. Tabular data with a “header row” can be thought of as a set of dictionaries, where column headers are the properties for each dictionary, and each row is a set of corresponding values. Dictionaries are also a basic type of data structure found in nearly all programming languages (referred to as associative arrays).
Again, we can introduce or remove constraints to define specialized types of dictionaries. A sorted dictionary adds an ordering over entries; in other words, it is a list of entries rather than a set. A multimap is a dictionary in which multiple entries may have the same key.
Trees
In dictionaries as they are commonly understood, properties are terms and values are their corresponding definitions. The terms and values are usually words, phrases, or other expressions that can be ordered alphabetically. But if generalize the notion of a dictionary as abstract sets of property-value pairs, the values can be anything at all. In particular, the values can themselves be dictionaries. When a dictionary structure has values that are themselves dictionaries, we say that the dictionaries are nested. Nesting is very useful for resource descriptions that need more structure than what a (non-nested) dictionary can provide.
Figure: Four Nested Dictionaries. presents an example of nested dictionaries. At the top level there is one dictionary with a single entry having the property a. The value associated with a is a dictionary consisting of two entries, the first having property b and the second having property c. The values associated with b and with c are also dictionaries.
If we nest dictionaries like this, and our “top” dictionary (the one that contains all the others) has only one entry, then we have a kind of tree structure. Figure: A Tree of Properties and Values. shows the same properties and values as Figure: Four Nested Dictionaries., this time arranged to make the tree structure more visible. Trees consist of nodes (the letters and numbers in Figure: A Tree of Properties and Values.) joined by edges (the arrows). Each node in the tree with a circle around it is a property, and the value of each property consists of the nodes below (to the right of) it in the tree. A node is referred to as the parent of the nodes below it, which in turn are referred to as the children of that node. The edges show these “parent of” relationships between the nodes. The node with no parent is called the root of the tree. Nodes with no children are called leaf nodes.
As with the other types of structures we have considered, we can define different kinds of trees by introducing different types of constraints. For example, the predominant metamodel for XML is documents is a kind of tree called the XML Information Set or Infoset. [10]
The XML Information Set defines a specific kind of tree structure by adding very specific constraints, including ordering of child nodes, to the basic definition of a tree. The addition of an ordering constraint distinguishes XML trees from nested dictionaries, in which child nodes do not have any order (because dictionary entries do not have an ordering). Ordering is an important constraint for resource descriptions, since without ordering it is impossible to, for example, list multiple authors while guaranteeing that the order of authors will be maintained. Figure: A Tree of Properties and Values. depicts a kind of tree with a different set of constraints: all non-leaf nodes are properties, and all leafs are values. We could also define a tree in which every node has both a property and a value. Trees exist in a large variety of flavors, but they all share a common topology: the edges between nodes are directed (one node is the parent and the other is the child), and every node except the root has exactly one parent.
Trees provide a way to group statements describing different but related resources. For example, consider the description structured as a dictionary here:
author given names → Winfried Georg
author surname → Sebald
title → Die Ringe des Saturn
pages → 371
The dictionary groups together four property-value pairs describing a particular book. (The arrows are simply a schematic way to indicate property-value relations. Later in the chapter we look at ways to “write” these relations using some specific syntax.)
But really the first two entries are not describing the book; they are describing the book’s author. So, it would be better to group those two statements somehow. We can do this by nesting the entries describing the author within the book description, creating a tree structure:
author →
given names → Winfried Georg
surname → Sebald
title → Die Ringe des Saturn
pages → 371
Using a tree works well in this case because we can treat the book as the primary resource being described, making it the root of our tree, and adding on the author description as a “branch.”
We also could have chosen to make the author the primary resource, giving us a tree like the one in Example: Nesting book descriptions within an author description.
given names → Winfried Georg
surname → Sebald
books authored →
1. title → Die Ringe des Saturn
pages → 371
2. title → Austerlitz
pages → 416
Note that in this dictionary, the value of the books authored property is a list of dictionaries. Making the author the primary or root resource allows us to include multiple book descriptions in the tree (but makes it more difficult to describe books having multiple authors). A tree is a good choice for structuring descriptions as long as we can clearly identify a primary resource. In some cases, however, we want to connect descriptions of related resources without having to designate one as primary. In these cases, we need a more flexible data structure.
Graphs
Suppose we were describing two books, where the author of one book is the subject of the other, as in Example: Two related descriptions:
1. author → Mark Richard McCulloch
title → Understanding W. G. Sebald
subject → Winfried Georg Sebald
2. author → Winfried Georg Sebald
title → Die Ringe des Saturn
By looking at these descriptions, we can guess the relationship between the two books, but that relationship is not explicitly represented in the structure: we just have two separate dictionaries and have inferred the relationship by matching property values. It is possible that this inference could be wrong: there might be two people named Winfried Georg Sebald. How can we structure these descriptions to explicitly represent the fact that the Winfried Georg Sebald that is the subject of the first book is the same Winfried Georg Sebald who authored the second?
One possibility would be to make Winfried Georg Sebald the root of a tree, similar to the approach taken in Example: Nesting book descriptions within an author description, adding a book about property alongside the books authored one. This solution would work fine if people were our primary resources, and it thus made sense to structure our descriptions around them. But suppose that we had decided that our descriptions should be structured around books, and that we were using a vocabulary that took this perspective (with properties such as author and subject rather than books authored and books about). We should not let a particular structure limit the organizational perspective we can take, as Batten’s cards did. Instead, we should consciously choose structures to suit our organizational perspective. How can we do this?
If we treat our two book descriptions as trees, we can join the two branches (subject and author) that share a value. When we do this, we no longer have a tree, because we now have a node with more than one parent (Figure: Descriptions Linked into a Graph.). The structure in Figure: Descriptions Linked into a Graph. is a graph. Like a tree, a graph consists of a set of nodes connected by edges. These edges may or may not have a direction (“Directionality”). If they do, the graph is referred to as a “directed graph.” If a graph is directed, it may be possible to start at a node and follow edges in a path that leads back to the starting node. Such a path is called a “cycle.” If a directed graph has no cycles, it is referred to as an “acyclic graph.”
A tree is just a more constrained kind of graph. Trees are directed graphs because the “parent of” relationship between nodes is asymmetric: the edges are arrows that point in a certain direction. (See “Symmetry”.) Furthermore, trees are acyclic graphs, because if you follow the directed edges from one node to another, you can never encounter the same node twice. Finally, trees have the constraint that every node (except the root) must have exactly one parent.[11]
In Figure: Descriptions Linked into a Graph. we have violated this constraint by joining our two book trees. The graph that results is still directed and acyclic, but because the Winfried George Sebald node now has two parents, it is no longer a tree.
Graphs are very general and flexible structures. Many kinds of systems can be conceived of as nodes connected by edges: stations connected by subway lines, people connected by friendships, decisions connected by dependencies, and so on. Relationships can be modeled in different ways using different kinds of graphs. For example, if we assume that friendship is symmetric (see “Symmetry”), we would use an undirected graph to model the relationship. However, in web-based social networks friendship is often asymmetric (you might “friend” someone who does not reciprocate), so a directed graph is more appropriate.
Often it is useful to treat a graph as a set of pairs of nodes, where each pair may or may not be directly connected by an edge. Many approaches to characterizing structural relationships among resources (see “Structural Relationships between Resources”) are based on modeling the related resources as a set of pairs of nodes, and then analyzing patterns of connectedness among them. As we will see, being able to break down a graph into pairs is also useful when we structure resource descriptions as graphs.
In “The Document Processing World” we will use XML to model the graph shown in Figure: Descriptions Linked into a Graph. by using “references” to connect a book to its title, authors and subject. This will allow us to develop sophisticated graphs of knowledge within a single XML document instance. (See also the sidebar, Inclusions and References)[12]
Comparing Metamodels: JSON, XML and RDF
Now that we are familiar with the various kinds of metamodels used to structure resource descriptions, we can take a closer look at some specific metamodels. A detailed comparison of the affordances of different metamodels is beyond the scope of this chapter. Here we will simply take a brief look at three popular metamodels—JSON, XML, and RDF—in order to see how they further specify and constrain the more general kinds of metamodels introduced above.
JSON
- JavaScript Object Notation (JSON)
-
JavaScript Object Notation(JSON) is a textual format for exchanging data that borrows its metamodel from the JavaScript programming language. Specifically, the JSON metamodel consists of two kinds of structures found in JavaScript: lists (called “arrays” in JavaScript) and dictionaries (called “objects” in JavaScript). Lists and dictionaries contain values, which may be strings of text, numbers, Booleans (true or false), or the null (empty) value. Again, these types of values are taken directly from JavaScript. Lists and dictionaries can be values too, meaning lists and dictionaries can be nested within one another to produce more complex structures such as tables and trees.
Lists, dictionaries, and a basic set of value types constitute the JSON metamodel. Because this metamodel is a subset of JavaScript, the JSON metamodel is very easy to work with in JavaScript. Since JavaScript is the only programming language that is available in all web browsers, JSON has become a popular choice for developers who need to work with data and resource descriptions on the web. (See “Writing Systems” later in this chapter.) Furthermore, many modern programming languages provide data structures and value types equivalent to those provided by JavaScript. So, data represented as JSON is easy to work with in many programming languages, not just JavaScript.
XML Information Set
The XML Information Set metamodel is derived from data structures used for document markup. (See “Metadata”.) These markup structures—elements and attributes—are well suited for programmatically manipulating the structure of documents and data together.[13]
- XML Infoset
-
The XML Infoset is a tree structure, where each node of the tree is defined to be an “information item” of a particular type. Each information item has a set of type-specific properties associated with it. At the root of the tree is a “document item,” which has exactly one “element item” as its child. An element item has a set of attribute items, and a list of child nodes. These child nodes may include other element items, or they may be character items. (See “Kinds of Structures” below for more on characters.) Attribute items may contain character items, or they may contain typed data, such as name tokens, identifiers and references. Element identifiers and references (ID/IDREF) may be used to connect nodes, transforming a tree into a graph. (See the sidebar, Inclusions and References.)[14]
Figure: A Description Structure. is a graphical representation of how an XML document might be used to structure part of a description of an author and his works. This example demonstrates how we might use element items to model the domain of the description, by giving them names such as author and title. The character items that are the children of these elements hold the content of the description: author names, book titles, and so on. Attribute items are used to hold auxiliary information about this content, such as its language.
This example also demonstrates how the XML Infoset supports mixed content by allowing element items and character items to be “siblings” of the same parent element. In this case, the Infoset structure allows us to specify that the book description can be displayed as a line of text consisting of the original title and the translated title in parentheses. The elements and attributes are used to indicate that this line of text consists of two titles written in different languages, not a single title containing parentheses.
If not for mixed content, we could not write narrative text with hypertext links embedded in the middle of a sentence. It gives us the ability to identify the subcomponents of a sentence, so that we could distinguish the terms “Sebald,” “walking” and “East Anglia” as an author and two subjects.
Using schemas to define data representation formats is a good practice that facilitates shared understanding and contributes to long-term maintainability in institutional or business contexts. An XML schema represents a contract among the parties subscribing to its definitions, whereas JSON depends on out-of-band communication among programmers. The notion that “the code is the documentation” may be fashionable among programmers, but modelers prefer to design at a higher level of abstraction and then implement.
The XML Infoset presents a strong contrast to JSON and does not always map in a straightforward way to the data structures used in popular web scripting languages. Whereas JSON’s structures make it easier for object-oriented programmers to readily exchange data, they lack any formal schema language and cannot easily handle mixed content.
RDF
In Figure: Descriptions Linked into a Graph., we structured our resource description as a graph by treating resources, properties, and values as nodes, with edges reflecting their combination into descriptive statements. However, a more common approach is to treat resources and values as nodes, and properties as the edges that connect them. Figure: Treating Properties as Edges Rather Than Nodes. shows the same description as Figure: Descriptions Linked into a Graph., this time with properties treated as edges. This roughly corresponds to the particular kind of graph metamodel defined by RDF. (“Resource Description Framework (RDF)”)
We have noted that we can treat a graph as a set of pairs of nodes, where each pair may be connected by an edge. Similarly, we can treat each component of the description in Figure: Treating Properties as Edges Rather Than Nodes. as a pair of nodes (a resource and a value) with an edge (the property) linking them. In the RDF metamodel, a pair of nodes and its edge is called a triple, because it consists of three parts (two nodes and one edge). The RDF metamodel is a directed graph, so it identifies one node (the one from which the edge is pointing) as the subject of the triple, and the other node (the one to which the edge is pointing) as its object. The edge is referred to as the predicate or (as we have been saying) property of the triple.
Figure: Listing Triples Individually. lists separately all the triples in Figure: Treating Properties as Edges Rather Than Nodes. However, there is something missing in Figure: Listing Triples Individually.. Figure: Treating Properties as Edges Rather Than Nodes. clearly indicates that the Winfried George Sebald who is the subject of book 1 is the same Winfried George Sebald who is the author of book 2. In Figure: Listing Triples Individually. this relationship is not clear. How can we tell if the Winfried George Sebald of the third triple is the same as the Winfried George Sebald of the triple statement? For that matter, how can we tell if the first three triples all involve the same book 1? This is easy to show in a diagram of the entire description graph, where we can have multiple edges attached to a node. But when we disaggregate that graph into triples, we need some way of uniquely referring to nodes. We need identifiers (“Choosing Good Names and Identifiers”). When two triples have nodes with the same identifier, we can know that it is the same node. RDF achieves this by associating URIs with nodes. (See “Resource Description Framework (RDF)”)
The need to identify nodes when we break down an RDF graph into triples becomes important when we want to “write” RDF graphs—create textual representations of them instead of depicting them—so that they can be exchanged as data. Tree structures do not necessarily have this problem, because it is possible to textually represent a tree structure without having to mention any node more than once. Thus, one price paid for the generality and flexibility of graph structures is the added complexity of recording, representing or writing those structures.
Choosing Your Constraints
This tradeoff between flexibility and complexity illustrates a more general point about constraints. In the context of managing and interacting with resource descriptions, constraints are a good thing. As discussed above, a tree is a graph with very specific constraints. These constraints allow you to do things with trees that are not possible with graphs in general, such as representing them textually without repeating yourself, or uniquely identifying nodes by the path from the root of the tree to that node. This can make managing descriptions and the resources they describe easier and more efficient—if a tree structure is a good fit to the requirements of the organizing system. For example, an ordered tree structure is a good fit for the hierarchical structure of the content of a book or book-like document, such as an aircraft service manual or an SEC filing. On the other hand, the network of relationships among the people and organizations that collaborated to produce a book might be better represented using a graph structure. XML is most often used to represent hierarchies, but is also capable of representing network structures.
Modeling within Constraints
A metamodel imposes certain constraints on the structure of our resource descriptions. But in organizing systems, we usually need to further specify the content and composition of descriptions of the specific types of resources being organized. For example, when designing a system for organizing books, it is not sufficient to say that a book’s description is structured using XML, because the XML metamodel constrains structure and not the content of descriptions. We need also to specify that a book description includes a list of contributors, each entry of which provides a name and indicates the role of that contributor. This kind of specification is a model to which our descriptions of books are expected to conform. (See “Abstraction in Resource Description”.)
When designing an organizing system we may choose to reuse a standard model. For example, ONIX for Books is a standard model (conforming to the XML metamodel) developed by the publishing industry for describing books.[22]
If no such standard exists, or existing standards do not suit our needs, we may create a new model for our specific domain. But we will not usually create a new metamodel: instead we will make choices from among the metamodels, such as JSON, XML, or RDF, that have been formally recognized and incorporated into existing standards. Once we have selected a metamodel, we know the constraints we have to work with when modeling the resources and collections in our specific domain.[23]
Specifying Vocabularies and Schemas
Creating a model for descriptions of resources in a particular domain involves specifying the common elements of those descriptions, and giving those elements standard names. (See “The Process of Describing Resources”) The model may also specify how these elements are arranged into larger structures, for example, how they are ordered into lists nested into trees. Metamodels vary in the tools they provide for specifying the structure and composition of domain-specific models, and in the maturity and robustness of the methods for designing them.[24] RDF and XML each provide different, metamodel-specific tools to define a model for a specific domain. But not every metamodel provides such tools.
In XML, models are defined in separate documents known as schemas. An XML schema defining a domain model provides a vocabulary of terms that can be used as element and attribute names in XML documents that adhere to that model. For example, Onix for Books schema specifies that an author of a book should be called a Contributor
, and that the page count should be called an Extent
. An XML schema also defines rules for how those elements, attributes, and their content can be arranged into higher-level structures. For example, the Onix for Books specifies that the description of a book must include a list of Contributor
elements, that this list must have at least one element in it, and that each Contributor
element must have a ContributorRole
child element.
If an XML schema is given an identifier, XML documents can use that identifier to indicate that they use terms and rules from that schema. An XML document may use vocabularies from more than one XML schema.[25] Associating a schema with an XML instance enables validation: automatically checking that vocabulary terms are being used correctly.[26]
If two descriptions share the same XML schema and use only that schema, then combining them is straightforward. If not, it can be problematic, unless someone has figured out exactly how the two schemas should “map” to one another. Finding such a mapping is not a trivial problem, as XML schemas may differ semantically, lexically, structurally, or architecturally despite sharing a common implementation form. (See Describing Relationships and Structures.)
Tree structures can vary considerably while still conforming to the XML Infoset metamodel. Users of XML often specify rules for checking whether certain patterns appear in an XML document (document-level validation). This is less often done with RDF, because graphs that conform to the RDF metamodel all have the same structure: they are all sets of triples. This shared structure makes it simple to combine different RDF descriptions without worrying about checking structure at the document level. However, sometimes it is desirable to check descriptions at the document level, as when part of a description is required. As with XML, if consumers of those descriptions want to assert that they expect those descriptions to have a certain structure (such as a required property), they must check them at the document level.
Because the RDF metamodel already defines structure, defining a domain-specific model in RDF mainly involves specifying URIs and names for predicates. A set of RDF predicate names and URIs is known as an RDF vocabulary. Publication of vocabularies on the web and the use of URIs to identify and refer to predicate definitions are key principles of Linked Data and the Semantic Web. (Also see “The Semantic Web and Linked Data”, as well as later in this chapter.)[27]
For example, the Resource Description and Access(RDA) standard for cataloging library resources includes a set of RDF vocabularies defining predicates usable in cataloging descriptions. One such predicate is:
<http://rdvocab.info/Elements/extentOfText>
which is defined as “the number and type of units and/or subunits making up a resource consisting of text, with or without accompanying illustrations.” The vocabulary further specifies that this predicate is a refinement of a more general predicate:
<http://rdvocab.info/Elements/extent>
which can be used to indicate, “the number and type of units and/or subunits making up a resource” regardless of whether it is textual or not.
JSON lacks any standardized way to define which terms can be used. That does not mean one cannot use a standard vocabulary when creating descriptions using JSON, only that there is no agreed-upon way to use JSON to communicate which vocabulary is being used, and no way to automatically check that it is being used correctly.
Controlling Values
So far, we have focused on how models specify vocabularies of terms and how those terms can be used in descriptions. But models may also constrain the values or content of descriptions. Sometimes, a single model will define both the terms that can be used for property names and the terms that can be used for property values. For example, an XML schema may enumerate a list of valid terms for an attribute value.[28]
Often, however, there are separate, specialized vocabularies of terms intended for use as property values in resource descriptions. Typically these vocabularies provide values for use within statements that describe what a resource is about. Examples of such subject vocabularies include the Library of Congress Subject Headings(LOC-SH) and the Medical Subject Headings(MeSH).[29] Other vocabularies may provide authoritative names for people, corporations, or places. Classification schemes are yet another kind of vocabulary, providing the category names for use as the values in descriptive statements that classify resources.
Because different metamodels take different approaches to specifying vocabularies, there will usually be different versions of these vocabularies for use with different metamodels. For example the LCSH are available both as XML conforming to the Metadata Authority Description Standard(MADS) schema, and as RDF using the Simple Knowledge Organization System(SKOS) vocabulary.
Specifying a vocabulary is just one way models can control what values can be assigned to properties. Another strategy is to specify what types of values can be assigned. For example, a model for book descriptions may specify that the value of a pages property must be a positive integer. Or it could be more specific; a course catalog might give each course an identifier that contains a two-letter department code followed by a 1-3 digit course number. Specifying a data type like this with a regular expression narrows down the set of possible values for the property without having to enumerate every possible value. (See the sidebar.)
In addition to or in lieu of specifying a type, a model may specify an encoding scheme for values. An encoding scheme is a specialized writing system or syntax for particular types of values. For example, a model like Atom for describing syndicated web content requires a publication date. But there are many different ways to write dates: 9/2/76
, 2 Sept. 1976
, September 2nd 1976
, etc. Atom also specifies an encoding scheme for date values. The encoding scheme is RFC3339, a standard for writing dates. When using RFC3339, one always writes a date using the same form: 1976-09-02
.[30]
Encoding schemes are often defined in conjunction with standardized identifiers. (See “Make Names Informative”.) For example, International Standard Book Numbers(ISBN) are not just sequences of Arabic numerals: they are values written using the ISBN encoding scheme. This scheme specifies how to separate the sequence of numerals into parts, and how each of these parts should be interpreted. The ISBN 978-3-8218-4448-0
has five parts, the first three of which indicate that the resource with this identifier is 1) a product of the book publishing industry, 2) published in a German-speaking country, and 3) published by the publishing house Eichborn.
Encoding schemes can be viewed as very specialized models of particular kinds of information, such as dates or book identifiers. But because they specify not only the structure of this information, but also how it should be written, we can also view them as specialized writing systems. That is, encoding schemes specify how to textually represent information.
In the second half of this chapter, we will focus on the issues involved in textually representing resource descriptions—writing them. Graphs, trees, dictionaries, lists, and sets are general types of structures found in different metamodels. Thinking about these broad types and how they fit or do not fit the ways we want to model our resource descriptions can help us select a specific metamodel. Specific metamodels such as the XML Infoset or RDF are formalized and standardized definitions of the more general types of structures discussed above. Once we have selected a metamodel, we know the constraints we have to work with when modeling the resources and collections in our specific domain. But because metamodels are abstract and exist only on a conceptual level, they can only take us so far. If we want to create, store, and exchange individual resource descriptions, we need to make the structures defined by our abstract metamodels concrete. We need to write them.
-
This discussion of Batten’s cards is based on (Lancaster 1968, pages 28-32). Batten’s own explanation is in (Batten 1951).
↵ - ↵
-
The technique of diagramming sentences was invented in the mid-19th century by Stephen W. Clark, a New York schoolmaster; (Clark2010) is an exact reprinting of a nearly 100 year old edition of his book A Practical Grammar. A recent tribute to Clark is (Florey 2012).
-
It is easy to underestimate the incredible power of the human perceptual and cognitive systems to apply neural computation and knowledge to enable vision and hearing to seem automatic. Computers are getting better at extracting features from visual and auditory signals to identify and classify inputs, but our point here is that none of these features are explicitly represented in the input “blob” or “stream.”
↵ -
As we commented earlier, an oral description of a resource may not be especially useful in an organizing system because computers cannot easily understand it. On the other hand, there are many contexts in which an oral description would be especially useful, such as in a guided tour of a museum where visitors can use audio headsets.
↵ -
What was lost was the previously invisible structure provided by the grammar, which made us assign roles to each of these terms to create a semantic interpretation.
↵ -
It is rarely practical to make things as simple as possible. According to Einstein, we should endeavor to “Make everything as simple as possible, but not simpler.”
↵ -
This structural metamodel only allows one value for each property, which means it would not work for books with multiple authors or that discuss multiple subjects.
↵ -
Going the other direction is not so easy, however: just as real dictionaries do not support finding a word given a definition, neither do dictionary structures support finding a key given a value.
-
The XML Information Set (Cowan2004)
RDF/XML is one example where meta models meet. In Document Design Matters, (Wilde and Glushko 2008b) point out that “If the designer of an exchange format uses a non-XML conceptual metamodel because it seems to be a better fit for the data model, XML is only used as the physical layer for the exchange model. The logical layer in this case defines the mapping between the non-XML conceptual model, and any reconstruction of the exchange model data requires the consumer to be fully aware of this mapping. In such a case, it is good practice to make users of the API aware of the fact that it is using a non-XML metamodel. Otherwise they might be tempted to base their implementation on a too small set of examples, creating implementations which are brittle and will fail at some point in time.”
↵ -
Technically, what is described here is referred to as “rooted tree” by mathematicians, who define trees more generally. Since trees used as data structures are always rooted trees, we do not make the distinction here.
↵ -
This feature relies upon the existence of an XML schema. An XML schema can declare that certain attributes are of type ID, IDREF or IDREFS. Whether an XML DTD or one of the many schema languages that have been developed under the auspices of the W3C or ISO.
-
The XML Infoset is one of many metamodels for XML, including the DOM and XPath. Typically, an XML Infoset is created as a by-product of parsing a well-formed XML document instance. An XML document may also be informed by its DTD or schema with information about the types of attribute values, and their default values. Attributes of type ID, IDREF and IDREFs provide a mechanism for intra-document hypertext linking and transclusion. An XML document instance may contain entity definitions and references that get expanded when the document is parsed, thereby offering another form of transclusion.
↵ -
A well-formed XML document instance, when processed, will yield an XML Information Set, as described here. Information sets may also be constructed by other means, such as transforming from another information set. See the section on Synthetic Infosets at
↵http://www.w3.org/TR/xml-infoset/#intro.synthetic
for details. -
The Infoset contains knowledge of whether all related declarations have been read and processed, the base URI of the document instance, information about attribute types, comments, processing instructions, unparsed entities and notations, and more.
A well-formed XML document instance for which there are associated schemas, such as a DTD, may contribute information to the Infoset. Notably, schemas may associate data types with element and attribute information items, and it may also specify default or fixed values for attributes. A DTD may define entities that are referenced in the document instance and are expanded in-place when processed. These contributions can affect the truth value of the document.
↵ -
The SGML standard explicitly stated that documentation describing or explaining a DTD is part of the document type definition. The implication being that a schema is not just about defining syntax, but also semantics. Moreover, since DTDs do not make possible to describe all possible constraints, such as co-occurrence constraints, the documentation could serve as human-consumable guidance for implementers as well as content creators and consumers.
↵ -
Attribute types may be declared in an XML DTD or schema. Attributes whose type is ID must have a valid XML name value that is unique within that XML document; an attribute of type IDREF whose value corresponds to a unique ID has a “references” property whose value is the element node that corresponds to the element with that ID. An attribute of type IDREFS whose value corresponds to a list of unique ID has a “references” property whose value is a list of element node(s) that corresponds to the element(s) with matching IDs.
↵ -
XML Inclusions (XInclude) is (Marsh, Orchard, and Veillard 2006).
-
XML Linking Language (XLink) is (DeRose, Maler, Orchard, and Walsh 2010).
-
Within the document’s DTD, one simply declares the entity and its corresponding value, which could be anything from an entire document to a phrase and then it may be referenced in place within the XML document instance. The entity reference is replaced by the entity value in the XML Infoset. Entities, as nameable wrappers, effectively disappear on their way into the XML Infoset.
↵ -
Online Information Exchange(ONIX) is the international standard for representing and communicating book industry product information in electronic form:
↵http://www.editeur.org/11/Books/
. -
Do not take on the task of creating a new XML model lightly. Literally thousands of XML vocabularies have been created, and some represent hundreds or thousands of hours of effort. See (Bray 2005) for advice on how to reduce the risk of vocabulary design if you cannot find an existing one that satisfies your requirements.
↵ -
See (Glushko and McGrath 2005) for a synthesis of best practices for creating domain-specific languages in technical publishing and business-to-business document exchange contexts. You need best practices for big problems, while small ones can be attacked with ad hoc methods.
↵ -
Unless an XML instance is associated with a schema, it is fair to say that it does not have any model at all because there is no way to understand the content and structure of the information it contains. The assignment of a schema to an XML instance requires a “Document Type Declaration.” If some of the same vocabulary terms occur in more than one XML schema, with different meanings in each, using elements from more than one schema in the same instance requires that they be distinguished using namespaces. For example, if an element named “title” means the “title of the book” in one schema and “the honorific associated with a person” in another, instances might have elements with namespace prefixes like <book:title>The Discipline of Organizing</book:title> and <hon:title>Professor</hon:title>. Namespaces are a common source of frustration in XML, because they seem like an overly complicated solution to a simple problem. But in addition to avoiding naming collisions, they are important in schema composition and organization.
↵ -
What “correctly” means depends on the schema language used to encode the conceptual model of the document type. The XML family of standards includes several schema languages that differ in how completely they can encode a document type’s conceptual model. The Document Type Definition(DTD) has its origins in publishing and enforces structural constraints well; it expresses strong data typing through associated documentation resources. XML Schema Definition Language(XSD) is better for representing transactional document types but its added expressive power tends to make it more complex.
-
For example, see Linked Open Vocabularies at
↵http://lov.okfn.org/dataset/lov/index.html
. -
Attribute values can be constrained in a schema by specifying a data type, a default value, and a list of potential values. Data types allow us to specify whether a value is supposed to be a name, a number, a date, a token or a string of text. Having established the data type, we can further constrain the value of an attribute by specifying a range of values, for a number or a date, for example. We can also use regular expression patterns to describe a data type such as a postal code, telephone number or ISBN number. Specifying default values and lists of legal values for attributes simplifies content creation and quality assurance processes. In Schematron, a rule-based XML schema language for making test assertions about XML documents, we can express constraints between elements and attributes in ways that other XML schema languages cannot. For example, we can express the constraint that if two
↵<title>
elements are provided, then each must contain a unique string value and differentlanguage
attribute values. -
See LOC-SH as
↵http://id.loc.gov/authorities/subjects.html
; MeSH athttp://www.nlm.nih.gov/mesh/
. -
The Atom Publishing Protocol is IETF RFC 5023, (
↵https://tools.ietf.org/html/rfc5023
); a good introduction is (Sayre 2005). IETF RFC ishttp://www.ietf.org/rfc/rfc3339.txt
. -
There is no single authority on the subject of regular expressions or their syntax. A good starting point is the Wikipedia article on the subject:
↵http://en.wikipedia.org/wiki/Regular_expression
.