The Process of Describing Resources

Robert J. Glushko

33 The Process of Describing Resources

We prefer the general concept of resource description over the more specialized ones of bibliographic description and metadata because it makes it easier to see the issues that cut across the domains where those terms dominate. In addition, it enables us to propose more standard process that we can apply broadly to the use of resource descriptions in organizing systems. A shared vocabulary enables the sharing of lessons and best practices.

The process of describing resources involves seven interdependent and iterative steps. We begin with a generic summary of the process to set the stage for a detailed step-by-step discussion.

Identifying resources to describe is the first step; this topic is covered in detail in “Resource Identity”. The resource domain and scope circumscribe the describable properties and the possible purposes that descriptions might serve. The resource focus determines which are primary information resources and which ones are treated as the corresponding resource descriptions. Two important decisions at this stage are granularity of description—are we describing individual resources or collections of them?—and the abstraction level—are we describing resource instances, parts of them, or resource types?
Generally, the purpose of resource description is to support the activities common to all organizing systems: selecting, organizing, interacting with, and maintaining resources, as we saw in Activities in Organizing Systems. The particular resource domain and the context in which descriptions are created and used imposes more specific requirements and constraints on the purposes that resource description can serve.
Once the purposes of description in terms of activities and interactions have been determined, the specific properties of the resources that are needed to enable them can be identified. The goal of description is not to be exhaustive; there are always more possible properties than can be reasonably described. Instead, the challenge is to use the properties that are most robust and reliable for supporting the desired interactions.
This step includes several logical and semantic decisions about how the resource properties will be described. What terms or element names should be used to identify the resource properties we have chosen to describe? Are there rules or constraints on the types of data or values that the property descriptions can assume? When dealing with numerical descriptions, their data types and levels of measurement constrain the kinds of processing to which they may submit. Nominal, ordinal, interval, and ratio data each are limited to particular transformations based on what they represent. A good description vocabulary will be easy to assign when creating resource descriptions and easy to understand when using them.
The logical and semantic decisions about the description vocabulary are reified by decisions about the notation, syntax, and structure of the descriptions. Taken together, these decisions collectively determine what we call the form or encoding of the resource descriptions. The implementation of the descriptions involves decisions about how and where they are stored and the technology used to create, edit, store, and retrieve them.
Resource descriptions are created by individuals, by informal or formal groups of people, or by automated or computational means. Some types of descriptions can only be created by people, some types of descriptions can only be created by automated or algorithmic techniques, and some can be created in either manner.
The resource descriptions must be evaluated with respect to their intended purposes. The results of this evaluation will help determine which or the preceding steps need to be redone.

The next seven sub-sections discuss each of these steps in detail. A quick reference guide is Figure: The Process of Describing Resources.

The Process of Describing Resources

This chart summarizes the process of describing resources, listing seven steps and some of the related decision points.

The process of describing resources consists of seven steps: Determining the scope and focus, determining the purpose, identifying resource properties, designing the description vocabulary, designing the description form and implementation, creating the descriptions, and evaluating the descriptions.

How explicit and systematic each step needs to be depend on the resource domain and scope, and especially on the intended users of the organizing system. If we look carefully, we can see most of these steps taking place even in very informal contexts, like the kids playing with Lego blocks with which we started this chapter. The goal of building things with the blocks leads the boys to identify which properties are most useful to analyze. They develop descriptions of the blocks that capture the specific values of the relevant properties. Finally, they evaluate their descriptions by using them when they play together; it becomes immediately obvious that a description is not serving its purpose when one boy hands a block to another that was not the one he thought he had asked for.

In contrast, a picture-taking scenario involves a much more explicit and systematic process of resource description. The resource properties, description vocabulary, and description form used automatically by a digital camera were chosen by an industry association and published as a technical specification implemented by camera and mobile phone manufacturers worldwide.

The resource descriptions used by libraries, archives, and museums are typically created in an even more explicit and systematic manner. Like the descriptions of the digital photo, the properties, vocabulary, and form of the descriptions used by their organizing systems are governed by standards. However, there is no equivalent to the digital camera that can create these descriptions automatically. Instead, highly trained professionals create them meticulously.

A great many resources and their associated descriptions in business and scientific organizing systems are created by automated or computational processes, so the process of describing individual resources is not at all like that in libraries and other memory institutions. However, the process for designing the data models or schemas for the class of resources that will be generated is equally systematic and is typically performed by highly skilled data analysts and data modelers.

Determining the Scope and Focus

Which resources do we want to describe? As we saw in Resources in Organizing Systems, determining what will be treated as a separate resource is not always easy, especially for resources with component parts and for information resources where the most important property is their content, which is not directly perceivable. Identifying the thing you want to describe as precisely as practical is the first step to creating a useful description.

In “Resource Focus”, we introduced the contrast between primary resources and description resources, which we called resource focus. Determining the resource focus goes hand in hand with determining which resources we intend to describe; these often arbitrary decisions then make a huge difference in the nature and extent of resource description. One person’s metadata is another person’s data.

For a librarian, the price of a book might be just one more attribute that is part of the book’s record.
For an accountant at a bookstore, the price of that book—both the cost to buy the book and the price at which it is then sold to customers—is critical information for staying in business.
In a medical records context, a patient’s insurance provider isn’t of much concern to the doctor, but to the person responsible for billing, it is central. For the nurse, the patient’s current vital signs may be of most importance, while for the doctor, it may be most important to understand how those data in aggregate serve to indicate a longer-term prognosis of the patient’s health.
A scientist studying comparative anatomy preserves animal specimens and records detailed physical descriptions about them, but a scientist studying ecology or migration discards the specimens and focuses on describing the context in which the specimen was located.

Describing Instances or Describing Collections

It is simplest to think of a resource description as being associated with another individual resource. As we discussed in Resources in Organizing Systems, it is challenging to determine what to treat as an individual resource when resources are themselves objects or systems composed of other parts or resources. For example, we sometimes describe a football team as a single resource and at other times we focus on each individual player. However, after deciding on resource granularity, the question remains whether each resource needs a separate description.

Libraries and museums specialize in curating resource descriptions about the instances in their collections. Resource descriptions are also applied to classes or collections of resources, because a collection is also a resource (“The Concept of “Collection””). Archives and special collections of maps are typically assigned resource descriptions, but each document or map contained in the collection does not necessarily have its own bibliographic description. Similarly, business and scientific datasets are invariably described at collection-level granularity because they are often analyzed in their entirety.

Furthermore, the granularity of description for a collection of resources tends to differ for different users or purposes. An investor who owns many different stocks focuses on their individual prices, while other investors put their money in index funds that combine all the separate prices into a single value.

Many web pages, especially e-commerce product catalogs and news sites, are dynamically assembled and personalized from a large number of information resources and services that are separately identified and described in content management and content delivery systems. However, a highly complex collection of resources that comes together in a single page is treated as a single resource when that page appears in a list of search engine results. Moreover, all of the separately generated pages can be given a single description when a user creates a bookmark to make it easy to return to the home page of the site.

Abstraction in Resource Description

We can also associate resource descriptions with an entire type or domain of resources. (See “Preserving Resource Types” and “Resource Domain”.) A collection of resource descriptions is vastly more useful when every resource is described using common description elements or terms that apply to every resource. A schema (or model, or metadata standard) specifies the set of descriptions that apply to an entire resource type. Sometimes this schema, model, or standard is inferred from or imposed on a collection of existing resources to ensure more consistent definitions, but more often, it is used as a specification when the resources are created or generated in the first place. (See What about “Creating” Resources? in “Introduction”.)

A relational database, for example, is easily conceptualized as a collection of records organized as one or more tables, with each record in its own row having a number of fields or attributes that contain some prescribed type of content. Each record or row in the database table is a description of a resource—an employee, a product, anything—and the individual attribute values, organized by the columns and rows of the table, are distinct parts of the description for some particular resource instance, like employee 24 or product 8012C.^[1]

The information resources that we commonly call documents are, by their nature, less homogeneous in content and structure than those that can be managed in databases. Document schemas, commonly represented in SGML or XML, usually allow for a mixture of data-like and textual descriptive elements.

XML schema languages have improved on SGML and XML by expressing the description of the document schema in XML itself, making it easy to create resources using the metadata as a template or pattern. XML schemas are often used as the specifications for XML resources created and used by information-intensive applications; in this context, they are often called XML vocabularies. XML schemas can be used to define web forms that capture resource instances (each filled-out form). XML schemas are also used to describe the interfaces to web services and other computational resources.^[2]

Scope, Scale, and Resource Description

If we only had one thing to describe, we could use a single word to describe it: “it.” We would not need to distinguish it from anything else. A second resource implies at least one more term in the description language: “not it.” However, as a collection grows, descriptions must become more complex to distinguish not only between, but also among resources.

Every element or term in a description language creates a dimension, or axis, along which resources can be distinguished, or it defines a set of questions about resources. Distinctions and questions that arise frequently need to be easy to address, such as:

What is the name of the resource?
Who created it?
What type of content or matter does it contain?

Therefore, as a collection grows, the language for describing resources must become more rigorous, and descriptions created when the collection was small often require revision because they are no longer adequate for their intended purposes.^[3]

This co-evolution of descriptive scope and description complexity is easy to see in the highly complex bibliographic descriptions created by professional catalogers. The commonly used Anglo-American Cataloguing Rules(AACR2) cataloging standards distinguish 11 different categories of resources and specify several hundred descriptive elements. AACR2 has recently been superseded by the Resource Description and Access(RDA) standards, which make fine-grained distinctions about content, media type, and carrier (technology).^[4]

Because the task of library resource description has been standardized at national and international levels, cataloging work is distributed among many describers whose results are shared. The principle of standardization has been the basis of centralized bibliographic description for a century.

Centralized resource description by skilled professionals works for libraries, but even in the earliest days of the web many library scientists and web authoring futurists recognized that this approach would not scale for describing web resources. In 1995, the Dublin Core(DC) metadata element set with only 15 elements was proposed as a vastly simpler description vocabulary that people not trained as professional catalogers could use. Since then, the Dublin Core initiative has been highly influential in inspiring numerous other communities to create minimalist description vocabularies, often by simplifying vocabularies that had been devised by professionals for use by non-professionals.^[5]

Of course, a simpler description vocabulary makes fewer distinctions than a complex one; replacing “author,” “artist,” “composer” and many other descriptions of the person or non-human resource responsible for the intellectual content of a resource with just “creator” (as Dublin Core does) results in a substantial loss of precision when the description is created and can cause misunderstanding when the descriptions are reused.^[6]

The negative impacts of growing scope and scale on resource description can sometimes be avoided if the ultimate scope and scale of the organizing system is contemplated when it is being created. It would not be smart for a business with customers in six US states to create an address field in its customer database that only handled those six states; a more extensible design would allow for any state or province and include a country code. In general, however, just as there are problems in adapting a simple vocabulary as scope and scale increase, designing and applying resource descriptions that will work for a large and continuously growing collection might seem like too much work when the collection at hand is small.

The challenges that arise with large description vocabularies are transformed when resource descriptions are created and assigned by computer algorithms. A large dataset might contain many thousands of descriptions for each resource, but clearly the computer does not have cognitive difficulty generating or using them. However, computer models with this many features can be hard for people to understand and trust.

Determining the Purposes

Resource description serves many purposes, and the mix of purposes and the resulting kinds of descriptions in any particular organizing system depends on the scope and scale of the resources being organized. We can identify and classify the most common purposes using the four activities that occur in every organizing system: selecting, organizing, interacting with, and maintaining resources (see Activities in Organizing Systems). Resource description also has a more open-ended purpose in sensemaking and science (see “Resource Description for Sensemaking and Science”); we observe and describe the world to make sense of our experiences and to predict future observations.

Resource Description to Support Selection

Defining selection as the process by which resources are identified, evaluated, and then added to a collection in an organizing system emphasizes resource descriptions created by someone other than the person who is using them. We can distinguish several different ways in which resource description supports selection:

Discovery: What available resources might be added to a collection? New resources are often listed in directories, registries, or catalogs. Some types of resources are selected and acquired automatically through subscriptions or contracts.
Capability and Compatibility: Will the resource meet functional or interoperability requirements? Technology-intensive resources often have numerous specialized types of descriptions that specify their functions, performance, reliability, and other “-ilities” that determine if they fit in with other resources in an organizing system. ^[7] Some services have qualities of service levels, terms and conditions, or interfaces documented in resource descriptions that affect their compatibility and interoperability. Some resources have licensing or usage restrictions that might prevent the resources from being used effectively for the intended purposes. Decisions about “people selection” are becoming more data-driven, and sports teams, business employers, and dating sites now rely on predictive statistics to find the best person.
Authentication: Is the resource what it claims to be? (“Authenticity”) Resource descriptions that can support authentication include technological ones like time stamps, watermarking, encryption, checksums, and digital signatures. The history of ownership or custody of a resource, called its provenance (“Provenance”), is often established through association with sales or tax records. Import and export certificates associated with the resource might be required to comply with laws designed to prevent the theft of antiquities or the transfer of technology or information with national security or foreign policy implications.
Appraisal: What is the value of this resource? What is its cost? At what rate does it depreciate? Does it have a shelf life? Does it have any associated ratings, rankings, or quality measures? Moreover, what is the quality of those ratings, rankings, and measures?

We also consider the perspective of the person creating the resource description and his or her primary purpose, which is often to encourage the selection of the resource by someone else. Product marketing is about devising names and descriptions to make a resource distinctive and attractive compared to alternatives. For many years prunes were promoted as a dietary supplement that people (especially old ones) need to “maintain regularity.” But after the California Prune Board (the world’s biggest supplier) re-branded them as “dried plums” and started marketing them as a snack food (and simultaneously renaming itself as the California Dried Plum Board) sales increased significantly.^[8]

Many countries require that imported goods are labeled with their country or origin. Consumers often use this property in resource descriptions as an indicator of high quality, as they might with Swiss watches, French or Italian fashions, or Canadian bacon. Alternatively, consumers might want to buy domestic or locally-sourced goods out of economic patriotism or to comply with procurement regulations. Not surprisingly, when consumers view origin in a positive light, this information is conspicuous and easy to read. In contrast, when consumers view origin less positively, perhaps as a warning of low-quality goods, the supplier is likely to make the origin information as inconspicuous as legally possible, or might even misrepresent the goods as domestic ones.^[9]

This misrepresentation is also ubiquitous in online dating, though the amount of misrepresentation must be balanced with the goals of the relationship and chances of the deception being discovered.^[10]

Resource Description to Support Organizing

We have defined organizing as specifying the principles for describing and arranging resources to create the capabilities upon which interactions are based. This definition treats the creation of resource descriptions and their use to organize resources for interactions as separate and sequential activities. This is easiest to see when people assign keywords and classifications to documents, or when sensors produce data, and these resource descriptions are later used to enable document retrieval or data analysis. A department store clerk might sort dress shirts on a display table using labels that describe their brands, sizes, and other properties. Rules governing the collection, integration, and analysis of personal information are also resource descriptions that influence the organization of information resources.

However, even if resource description and resource organization are logically separable, at times they are intertwined. When you arrange your own clothes, you don’t use explicit resource descriptions and instead rely on implicit ones about easily perceived properties like color, shape, and material of composition. When algorithms rather than people analyze texts to identify descriptive features for applications like information retrieval, spam classification, and sentiment analysis, resource descriptions and resource organization co-evolve, often continuously as the algorithm adapts and learns with each new resource it describes. This tight connection between resource description and resource organization is also exploited in organizing systems that use usage records from session logs, browsing, or downloading activities as interaction resources, tying them to payments for using the resources or analyzing them to influence the selection and organizing of resources in future personalized interactions. (See “The Concept of “Interaction Resource””)

Resource Description to Support Interactions

Most discussions of the purposes of resource descriptions and metadata emphasize the interactions that are based on resource descriptions that have been intentionally and explicitly assigned. The Functional Requirements for Bibliographic Records(FRBR), defined by library scientists, specifies the four interactions of Finding, Identifying, Selecting, and Obtaining resources, but these apply generically to organizing systems, not just those in libraries.^[11]

Finding: What resources are available that “correspond to the user’s stated search criteria” and thus can satisfy an information need? Before there were online catalogs and digital libraries, we found resources by referencing catalogs of printed resource descriptions incorporating the title, author, and subject terms as access points into the collection; the subject descriptions were the most important finding aids when the user had no particular resource in mind. Modern users accept that computerized indexing makes search possible over not only the entire description resource, but often over the entire content of the primary resource. Businesses search directories for descriptions of company capabilities to find potential partners, and they also search for descriptions of application interfaces (APIs) that enable them to exchange information in an automated manner.
Identifying: Another purpose of resource description is to enable a user to confirm the identity of a specific resource or to distinguish among several that have some overlapping descriptions. In bibliographic contexts, this might mean finding the resource that is identified by its citation. Computer processable resource descriptions like bar codes, QR codes, or RFID tags are also used to identify resources. In Semantic Web contexts, URIs serve this purpose. Color can be used as resource descriptions when physical resources need to be identified quickly.^[12]
Selecting: Selecting in this context means the user activity of using resource descriptions to support a choice of resource from a collection, not the institutional activity of selecting resources for the collection in the first place. Search engines typically use a short “text snippet” with the query terms highlighted as resource descriptions to support selection. People often select resources with the least restrictions on uses as described in a Creative Commons license.^[13] A business might select a supplier or distributor that uses the same standard or industry reference model to describe its products or business processes because it is almost certain to reduce the cost of doing business with that business partner.^[14]
Obtaining: Physical resources often require significant effort to obtain after they have been selected. Catching a bus or plane involves coordinating your current location and time with the time and location the resource is available. With information resources in physical form, obtaining a selected resource usually meant a walk through the library stacks. With digital information resources, a search engine returns a list of the identifiers of resources that can be accessed with just another click, so it takes little effort to go from selecting among the query results to obtaining the corresponding primary resource.^[15]

Elaine Svenonius proposed that a fifth task called Navigation be added to the FRBR list, and in 2016 that happened but it was renamed as “Explore”:^[16]

Navigation or Explore

If users are not able to specify their information needs in a way that the finding functionality requires, they should be able to use relational and structural descriptions among the resources to navigate from any resource to other ones that might be better. Svenonius emphasizes generalization, aggregation, and derivational relationships.^[17] But in principle, any relationship or property could serve as the navigation “highway” between resources.

What some authors call “structural metadata” can be used to support the related tasks of moving within multi-part digital resources like electronic books, where each page might have associated information about previous, next, and other related pages. Documents described using XML models can use Extensible Stylesheet Language Transformations(XSLT) and XPath to address and select data elements, sub-trees, or other structural parts of the document.^[18]

Resource Description to Support Maintenance

Many types of resource descriptions that support selection (“Resource Description to Support Selection”) are also useful over time to support maintenance of specific resource and the collection to which they belong. In particular, technical information about resource formats and technology (software, computers, or other) needed to use the resources, and information needed to ensure resource integrity is often called “preservation metadata” in a maintenance context.^[19]

Resource descriptions that are more exclusively associated with maintenance activities include version information and effectivity, or useful life information. Equipment maintenance schedules are typically related to the number of miles driven (indicated by a car’s odometer), number of hours operated (stored by many engines), number of pages printed, or other easily recorded information about resource use or interactions. With smart resources now capable of capturing, analyzing, and communicating more data about real-time performance, more sophisticated prediction and scheduling of maintenance work is now possible. It is also easier to identify resources that are not being used as much as expected, which might imply that they are no longer needed and can thus be safely archived or discarded.

Resource Description for Sensemaking and Science

Up to now in “Determining the Purposes”, we have discussed how resource descriptions are used to perform well-defined tasks within an existing organizing system. However, there is a broader and less well-defined purpose of resource description that is older and more fundamental: the use of resource descriptions as the raw material for making sense of the world.

For thousands of years, even before the invention of written language, people have systematically collected things, information about those things, and observations of all kinds to understand how their world works. Paleolithic humans made cave paintings depicting the results of hunts and animal migrations; ancient Egyptians recorded the annual floods of the Nile River in stone carvings; and Babylonian, Egyptian, Chinese, and Mesoamerican astronomers organized lunar, solar, and planetary observations as calendars starting about five thousand years ago.

These diverse efforts to impose meaning on experience by recording, analyzing, organizing, and reorganizing observations can be collectively described as sensemaking. (See the sidebar, Sensemaking and Organizing.)

People organize to make sense of equivocal inputs and enact this sense back into the world to make it more orderly (Weick 2005)

Sensemaking and organizing are intertwined. Ancient cultures recorded time-based observations and analyzed patterns among crop cycles, commodity prices, weather conditions, and astronomical sightings. Think back to the early astronomers, who oriented temple buildings to align with astronomical events and who decorated temple walls with zodiac imagery.

Which of the planets and stars in the night sky should they observe and how should they record the details of those observations?
What mathematical and statistical techniques should be used to analyze and describe these observations?
What subset of observations are most useful in predicting the onset of the Nile River floods, caused by unobserved rainfall thousands of miles away?

Every choice about what to observe and how to describe it reflects a set of assumptions and potential hypotheses that are often implicit and unstated. Choices that increase understanding are built upon, and those that fail to provide insight are abandoned, but there is no guarantee that the iterative process of choosing what to observe and describe will lead to a correct understanding.

The principle that an accurate or comprehensive dataset is insufficient on its own to yield a correct model is exemplified in the interlocking efforts of Tycho Brahe and Johannes Kepler. Brahe was a 16th-century Danish nobleman astronomer who spent decades collecting data about the positions of hundreds of stars and the planets. However, because of prevailing religious and scientific biases, Brahe accepted the incorrect assumptions that the sun and planets revolved around the earth in circular orbits. After Brahe died in 1601, Kepler spent a decade analyzing Brahe’s data, and then rejected the idea of earth-centric and circular planetary orbits in favor of elliptical ones with the sun at one focus. These new organizing principles for Brahe’s data made the model of the solar system vastly simpler, and Kepler was able to discover laws of planetary motion that are part of the foundation of modern astronomy and physics.^[20]

Some aspects of sensemaking are hard-wired by evolution, which has given our brains powerful mechanisms that automatically simplify and organize the perceptual data we obtain from the world (see the sidebar Gestalt Principles). But this automatic sensemaking is dominated and amplified by intentional sensemaking.

Intentional sensemaking takes place when systematic statistical, experimental, and scientific methods are consciously followed to extract and organize knowledge from collections of samples, observations, or measurements. It is critical to recognize here that the contents of these collections represent choices made about what to collect, because most things and most phenomena have a great many descriptions or properties that could be recorded about them.

After things or data have been collected, statistical methods summarize the values of properties in a collection or dataset and the relationships among them. Making sense of a single collection or dataset by determining the properties that contrast and classify the instances is the start toward the more important goal of understanding the larger set or population from which the initial collection is just a sample. There is no better example of this than the periodic table of elements developed by Mendeleev in 1869, who organized known elements on the basis of their common chemical properties and then successfully predicted some properties of yet undiscovered ones.

Computational models developed from the initial dataset can predict future observations. Classification models assign a new instance to a category (e.g., spam or not spam message, Madison or Hamilton as author, outdoor or indoor scene); regression models predict a specific value of some measurement (given a description of a new movie, how much money will it make?); ordinal regression models predict values for non-metric measures (how much will you like the movie?). Experimental methods for hypothesis testing help develop and refine models of any type by systematically varying the conditions under which observations are made to discover how the results change in different situations.

A fundamental challenge in sensemaking and modeling is finding a balance between the competing goals of understanding a particular collection or dataset and being able to apply that understanding to new instances. Models can differ in the number of resource descriptions they use as parameters, and it is easy and tempting to overfit a model by using more parameters that capture random variations in observations. Overfitting produces spurious accuracy in reproducing the original observations, but it makes models less generalizable.

The highest level of sensemaking is the creation of scientific models or theories that propose interpretable and causal mechanisms for the observations. And just as automatic sensemaking creates simple explanations, scientists generally prefer simpler theories, a heuristic known as Occam’s Razor, or the law of parsimony. Even though complex theories can sometimes be more accurate, simpler theories produce more testable predictions, making it easier to verify or refine the theory. Occam’s famous principle, expressed eight centuries ago, is to prefer models that make the fewest assumptions, often measured in terms of the number of parameters or variables needed to make a prediction.^[21]^[22]^[23]

Identifying Properties

Once the purposes of description have been established, we need to identify the specific properties of the resources that can satisfy those purposes. There are four reasons why this task is more difficult than it initially appears.

First, any particular resource might need many resource descriptions, all of which relate to different properties, depending on the interactions to be supported and the context in which they take place. Selecting people for a basketball team focuses on their physical properties such as height, strength, leaping ability, and coordination. Selections for a debate team will be more concerned with their verbal and intellectual properties.
Second, different types of resources need to incorporate different properties in their descriptions. For resources in a museum, these might include materials and dimensions of pieces of art; for files and services managed by a network administrator, these include access control permissions; for electronic books or DVDs, they would include the digital rights management (DRM) code that expresses what you can and cannot do with the resource.
Third, as we briefly touched on in “Scope, Scale, and Resource Description”, which properties participate in resource descriptions depends on who is doing the describing. It makes little sense to expect fine-grained distinctions and interpretations about properties from people who lack training in the discipline of organizing. We will return to this tradeoff in “Creating Resource Descriptions” and again in “Describing Museum and Artistic Resources”.
Fourth, what might seem to be the same property at a conceptual level might be very different at an implementation level. Many resources have a resource description that is a surrogate or summary of the primary resource. For photos, paintings, and other resources whose appearance is their essence, an appropriate summary description can be a smaller, reduced resolution photo of the original. This surrogate is simple to create and easy for users to relate to the primary resource. On the other hand, distilling a text down to a short summary or abstract is a skill unto itself. Time-based resources provide greater challenges for summary. Should the summary of a movie be a textual summary of the plot, a significant clip from the movie, a video summary, or something else altogether?

This implementation gap is often very large for properties about people because people are not as easy to measure as most types of resources. Businesses need to quantify a person’s interest in their products to predict what price they would be willing to pay, but “interest” cannot be measured directly. Instead, predictions rely on proxy measures for “interest” like how long the customer looked at the product web page and whether they also looked at a competitor’s web page.

Two important dimensions for understanding and contrasting resource properties used in descriptions and organizing principles are: property essence—whether the properties are intrinsically or extrinsically associated with the resource, and; property persistence—whether the properties are static or dynamic. Taken together these two dimensions yield four categories of properties, as illustrated in Figure: Property Essence x Persistence: Four Categories of Properties. These four categories provide a useful framework for thinking about resource properties, even if, at times, the classification of properties is debatable.^[24]

Property Essence x Persistence:

Four Categories of Properties

A matrix of “Property Essence” (Intrinsic vs Extrinsic) and “Property Persistence” (Dynamic vs Static). Definitions and examples from the text are summarized here.

The distinctions of property persistence and property essence combine to distinguish four categories of properties: intrinsic static, extrinsic static, intrinsic dynamic, and extrinsic dynamic properties.

Intrinsic Static Properties

Intrinsic or implicit properties are inherent in the resource and can often be directly perceived or experienced. Static properties do not change their values over time. The species of an animal, the material of composition of a wooden chair, and the diameter of a wheel are all static properties that do not change their values over time. Static properties like color or shape are often used to describe and organize physical resources.

The spots on a Dalmatian dog are intrinsic static properties that appear shortly after birth, and they are so distinctive that it is impossible to describe the breed without acknowledging the spots.

A Dalmatian dog; white with black spots.

This particular Dalmatian is the “greeter” at the Viader Winery in Deer Park, California. The dog is nice and the wines are excellent.

(Photo by R. Glushko.)

Intrinsic physical properties are usually just part of resource descriptions. In many cases, physical properties describe only the surface layer of a resource, revealing little about what something is or its original intended purpose, what it means, or when and why it was created. The author of a song and the context of its creation are other examples of intrinsic and static resource properties that are not directly perceivable.

Intrinsic descriptions are often extracted or calculated by computational processes. For example, a computer program might calculate the frequency and distribution of words in some particular document. Similarly, visual signatures or audio fingerprints are intrinsic descriptions (“Describing Non-text Resources”).

Some relationships among resources are intrinsic and static, like the parent-child relationship or the sibling relationship between two children with the same parents. Part-whole or compositional relationships for resources with parts are also intrinsic static properties often used in resource descriptions. However, it is better to avoid treating resource relationships as properties, and instead express them as relations. Describing Relationships and Structures discusses part-whole and other semantic relationships in great detail.

Extrinsic Static Properties

Extrinsic or explicit properties are assigned to a resource rather than being inherent in it. The name or identifier of a resource is often arbitrary but once assigned does not usually change. Arranging resources according to the alphabetical or numerical order of their descriptive identifiers is a common organizing principle. Classification numbers and subject headings assigned to bibliographic resources are extrinsic static properties, as are the serial numbers stamped on or attached to manufactured products.

For information resources that have a digital form, the properties of their printed or rendered versions might not be intrinsic. Some text formats completely separate content from presentation, and as a result, style sheets can radically change the appearance of a printed document or web page without altering the primary resource in any way. For example, were a different style applied to this paragraph to highlight it in bold or cast in 24-point font, its content would remain the same.

Intrinsic Dynamic Properties

Intrinsic dynamic properties change over time. Developmental personal characteristics like a person’s height and weight, skill proficiency, or intellectual capacity, for example. Because these properties are not static, they are usually employed only to organize resources whose membership in the collection is of limited duration. Sports programs or leagues that segregate participants by age or years of experience are using intrinsic dynamic properties to describe and organize the resources.

Extrinsic Dynamic Properties

Extrinsic dynamic properties are in many ways arbitrary and can change because they are based on usage, behavior, or context. The current owner or location of a resource, its frequency of access, the joint frequency of access with other resources, its current popularity or cultural salience, or its competitive advantage over alternative resources are typical extrinsic and dynamic properties that are used in resource descriptions. A topical book described as a best seller one year might be found in the discount sales bin a few years later. A student’s grade point average is an extrinsic dynamic property.

A painting. The subjects are two men in fine dress, whose elbows rest on a table laden with various objects of interest.

An example of a cultural category that has far outlasted its motivation is that of the Holbein carpet. A particular type of geometrically patterned Ottoman rug came to be known as a “Holbein carpet” after the German Renaissance painter Hans Holbein, who often depicted the rugs in his work (probably to show off his extremely meticulous technique). Holbein was famous in his time, and his commissioned paintings of the English King Henry VIII have Henry standing on such rugs. This painting, called “The Ambassadors,” was painted in 1533 and now hangs in The National Gallery, London.

(Source: Google Art Project)

Extrinsic dynamic properties are useful features for data scientists making prediction or classification models. Your current location, the thing you just bought, and the place you bought it can be viewed as manifestations of unobservable preferences and values. Fingerprints found on a doorknob at a crime scene are an extrinsic dynamic property associated with the door, and clever detectives would analyze them along with other interaction resources they discovered with the goal of identifying the person for whom the fingerprints are intrinsic static properties.

Many relationships between resources are extrinsic and dynamic properties, like that of best friend.

Contextual properties are those related to the situation or context in which a resource is described. Dey defines context as “any information that characterizes a situation related to the interactions between users, applications, and the surrounding environment.” ^[25] This open-ended definition implies a large number of contextual properties that might be used in a description; crisper definitions of context might be “location + activity” or “who, when, where, why.” Since context changes, context-based descriptors might be appropriate when assigned but can have limited persistence and effectivity (“Resources over Time”); the description of a document as “receipt of a recent purchase” will not be useful for very long.

Recent advances in computing technology and data science techniques are making it possible to discover or create resource properties that are called “latent” because they are inferred rather than observed. Many such features are used by businesses to segment customers or make recommendations to them based on their recent behavior, so these features are also extrinsic and dynamic.

Your own movie preferences prove that easy to identify properties like sex and age do not differentiate movie watchers enough to make good recommendations, even if you combine them to create a category like “single male students between 18 and 25.” Netflix found that it was necessary to combine demographic properties, viewing history, and browsing behavior with very detailed ratings of dozens of movie properties to make good recommendations. It takes enormous computing power to discover a category of Netflix users who typically like action movies, yet consistently hover their mouse over romance movies, and to use this latent feature to recommend a sub-genre of western movies (one of nearly 100,000) that it calls “Romantic Action Adventure Movies.”^[26]

Citations of one information resource by another are extrinsic static descriptions when they are in print form, but when they are published in digital libraries it is usually the case that “cited by” is a dynamic resource description. Similarly, any particular link from one web page to another is an extrinsic static description, but because many web pages themselves are highly dynamic, we can also consider links as dynamic as well. Citations and web links are discussed in more detail in Describing Relationships and Structures.

Resources are often described with cultural properties that derive from conventional language or culture, often by analogy, because they can be highly evocative and memorable. ^[27]

Sometimes a cultural description outlives its salience, losing its power to evoke anything other than puzzlement about what it might mean.^[28]

For the Lego boys, current with the latest Star Wars movies, “light saber” was just the obvious description for a long, neon tube with a handle. However, someone unfamiliar with the Star Wars franchise might not understand“light saber,” and would describe the piece some other way.

Designing the Description Vocabulary

After we have determined the properties to use in resource descriptions, we need to design the description vocabulary: the set of words or values that represent the properties. “Naming Resources” discussed the problems of naming and proposed principles for good names, and since names are a very important resource description, much of what we said there applies generally to the design of the description vocabulary.

However, because the description vocabulary as a whole is much more than just the resource name, we need to propose additional principles or guidelines for this step. In addition, some new design questions arise when we consider all the resource descriptions as a set whose separate descriptions are created by many people over some period of time.

Principles of Good Description

In The Intellectual Foundation of Information Organization, Svenonius proposes a set of principles or “directives for design” of a description language.^[29] Her principles, framed in the narrow context of bibliographic descriptions, generally apply to the broad range of resource types we consider in this book.

User Convenience: Choose description terms with the user in mind; these are likely to be terms in common usage among the target audience.
Representation: Use descriptions that reflect how the resources describe themselves; assume that self-descriptions are accurate.
Sufficiency and Necessity: Descriptions should have enough information to serve their purposes and not contain information that is not necessary for some purpose; this might imply excluding some aspects of self-descriptions that are insignificant.
Standardization: Standardize descriptions to the extent practical, but also use aliasing to allow for commonly used terms.
Integration: Prefer the same properties and terms for all types of resources.

Any set of general design principles faces two challenges.

The first is that implementing any principle requires many additional and specific context-dependent choices for which the general principle offers little guidance. For example, how does the principle of Standardization apply if multiple standards already exist in some resource domain? Which of the competing standards should be adopted, and why?
The second challenge is that the general principles can sometimes lead to conflicting advice. The User Convenience recommendation to choose description terms in common use fails if the user community includes both ordinary people and scientists who use different terms for the same resources; whose “common usage” should prevail?

Who Uses the Descriptions?

Focus on the user of the descriptions. This is a core idea that we cannot overemphasize because it is implicit in every step of the process of resource description. All of the design principles in the previous section share the idea that the design of the description vocabulary should focus on the user of the descriptions. Are the resources being organized personal ones, for personal and mostly private purposes? In that case, the description properties and terms can be highly personal or idiosyncratic and still follow the design principles.

Similarly, when resource users share relevant knowledge, or are in a context where they can communicate and negotiate, if necessary, to identify the resources, their resource descriptions can afford to be less precise and rigorous than they might otherwise need to be. This helps explain the curious descriptions in the Lego story with which we began this chapter. The boys playing with the blocks were talking to each other with the Legos in front of them. If they had not been able to see the blocks the others were talking about, or if they had to describe their toys to someone who had never played with Legos before, their descriptions would have been quite different.

More often, however, resource descriptions can not assume this degree of shared context and must be designed for user categories rather than individual users: library users searching for books, business employees or customers using part and product catalogs, scientists analyzing the datasets from experiments or simulations. In each of these situations resource descriptions will need to be understood by people who did not create them, so the design of the description vocabulary needs to be more deliberate and systematic to ensure that its terms are unambiguous and sufficient to ensure reliable context-free interpretation. A single individual seldom has the breadth of domain knowledge and experience with users needed to devise a description vocabulary that can satisfy diverse users with diverse purposes. Instead, many people working together typically develop the required description vocabulary. We call the results institutional vocabularies, to contrast them with individual or cultural ones. (We will discuss this contrast more fully in Categorization: Describing Resource Classes and Types)

Everyone knows something about trees, but some people know more than others, and their particular experience and perspective influences how they describe trees. What kind of properties and descriptions would be used by university students? By research botanists? By landscape designers? By park maintenance workers? By indigenous people who live in tropical rain forests?

Some resource descriptions are designed for use by machines, which seemingly reduces the importance of design principles that consider user preferences or common uses. However, even if resources are described and organized by algorithms, when people need to explain the classifications and predictions that the algorithms produce, resource descriptions that are comprehensible and easily communicated are preferable to statistically optimal ones. Moreover, standardization and integration principles become more important for inter-machine communication to enable efficient processing, reuse of data and software, and increased interoperability among organizing systems.^[30]

Controlled Vocabularies and Content Rules

As we defined in “Use Controlled Vocabularies”, a controlled vocabulary is a fixed or closed set of description terms in some domain with precise definitions that is used instead of the vocabulary that people would otherwise use. For example, instead of the popular terms for descriptions of diseases or symptoms, medical researchers and teaching hospitals can use the National Library of Medicine’s Medical Subject Headings(MeSH) controlled vocabulary.^[31]

We can distinguish a progression of vocabulary control: a glossary is a set of allowed terms; a thesaurus is a set of terms arranged in a hierarchy and annotated to indicate terms that are preferred, broader than, or narrower than other terms; an ontology expresses the conceptual relationships among the terms in a formal logic-based language so they can be processed by computers. We will say more about ontologies in Describing Relationships and Structures.

Content rules are similar to controlled vocabularies because they also limit the possible values that can be used in descriptions. Instead of specifying a fixed set of values, content rules typically restrict descriptions by requiring them to be of a particular data type (integer, Boolean, Date, and so on). Possible values are constrained by logical expressions (e.g., a value must be between 0 and 99) or regular expressions (e.g., must be a string of length 5 that must begin with a number). Content rules like these are used to ensure valid descriptions when people enter them in web forms or other applications.

Vocabulary Control as Dimensionality Reduction

In most cases, a controlled vocabulary is a subset of the natural or uncontrolled vocabulary, but sometimes it is a new set of invented terms. This might sound odd until we consider that the goal of a controlled vocabulary is to reduce the number of descriptive terms assignable to a resource. Stated this way the problem is one of dimensionality reduction, transforming a high-dimensional space into a lower-dimensional one. Reducing the number of components in a multidimensional description can be accomplished by many different statistical techniques that go by names like “feature extraction,” “principle components analysis,” “orthogonal decomposition,” “latent semantic analysis,” “multidimensional scaling,” and “factor analysis.” ^[32]

These techniques might sound imposing and they are computationally complex, but they all have the same simple concept at their core, that the features or properties that describe some resource are often highly correlated. For example, a document that contains the word “car” is more likely to contain the words “driver” and “traffic” than a document that does not. Similar correlations exist among the visual features used to describe images and the acoustic features that describe music. Dimensionality reduction techniques analyze the correlations between resource descriptions to transform a large set of descriptions into a much smaller set of uncorrelated ones. In a way this implements the principle of Sufficiency and Necessity we mentioned in “Principles of Good Description” because it eliminates description dimensions or properties that do not contribute much to distinguishing the resources.

Here is an oversimplified example that illustrates the idea. Suppose we have a collection of resources, and every resource described as “big” is also described as “red,” and every “small” resource is also “green.” This perfect correlation between color and size means that either of these properties is sufficient to distinguish “big red” things from “small green” ones, and we do not need clever algorithms to figure that out. But if we have thousands of properties and the correlations are only partial, we need the sophisticated statistical approaches to choose the optimal set of description properties and terms, and in some techniques the dimensions that remain are called “latent” or “synthetic” ones because they are statistically optimal but do not map directly to resource properties.

Designing the Description Form

By this step in the process of resource description we have made numerous important decisions about which resources to describe, the purposes for which we are describing, them, and the properties and terms we will use in the descriptions. As much as possible we have described the steps at a conceptual level and postponed discussion of implementation considerations about the notation, syntax, and deployment of the resource descriptions separately or in packages. Separating design from implementation concerns is an idealization of the process of resource description, but is easier to learn and think about resource description and organizing systems if we do. We discuss these implementation issues in The Forms of Resource Descriptions.

Sometimes we have to confront legacy technology, existing or potential business relationships, regulations, standards conformance, performance requirements, or other factors that have implications for how resource descriptions must or should be implemented, stored, and managed. We will take this more pragmatic perspective in The Organizing System Roadmap, The Organizing System Roadmap, but until then, we will continue to focus on design issues and defer discussion of the implementation choices.

Creating Resource Descriptions

Resource descriptions can be created by professionals, by the authors or creators of resources, by users, or by computational or automated means.

From the traditional perspective of library and information science with its emphasis on bibliographic description, these modes of creation imply different levels of description complexity and sophistication; Taylor and Joudrey suggest that professionals create rich descriptions, untrained users at best create structured ones, and automated processes create simple ones.

This classification reflects a disciplinary and historical bias more than reality. “Simple” resource descriptions are “no more than data extracted from the resource itself… the search engine approach to organizing the web through automated indexing techniques.”^[33]

It might be fair to describe an inverted index implementation of a Boolean information retrieval model as simple, but it is clearly wrong to consider what Google and other search engines do to describe and retrieve web resources as simple.^[34]

A better notion of levels of resource description is one based on the amount of interpretation imposed by the description, an approach that focuses on the descriptions themselves rather than on their methods of creation. We will discuss this sort of approach in “Describing Museum and Artistic Resources” in the context of describing museum and artistic resources.

Professionally-created resource descriptions, author- or user-created descriptions, and computational or automated descriptions each have strengths and limitations that impose tradeoffs. A natural solution is to try to combine desirable aspects from each in hybrid approaches. For example, the vocabulary for a new resource domain may arise from tagging by end users but then be refined by professionals, lay classifiers may create descriptions with help from software tools that suggest possible terms, or software that creates descriptions can be improved by training it with human-generated descriptions, a form of supervised learning (see “Categories Created by Clustering”).

Often existing resource descriptions can or must be transformed or enhanced to meet the ongoing needs of an organizing system, and sometimes these processes can be automated. We will defer further discussion of those situations to Interactions with Resources. In the discussion that follows we focus on the creation of new resource descriptions where none yet exist.

Resource Description by Professionals

Before the web made it possible for almost anyone to create, publish, and describe their own resources and to describe those created and published by others, resource description was generally done by professionals in institutional contexts. Professional indexers and catalogers described bibliographic and museum resources after having been trained to learn the concepts, controlled descriptive vocabularies, and the relevant standards. In information systems domains professional data and process analysts, technical writers, and others created similarly rigorous descriptions after receiving analogous training. We have called these types of resource descriptions institutional ones to highlight the contrast between those created according to standards and those created informally in ad hoc ways, especially by untrained or undisciplined individuals.^[35]

Resource Description by Authors or Creators

The author or creator of a resource can be presumed to understand the reasons why and the purposes for which the resource can be used. And, presumably, most authors want to be read, so they will describe their resources in ways that will appeal to and be useful to their intended users. However, these descriptions are unlikely to use the controlled vocabularies and standards that professional catalogers would use.

Resource Description by Users

Today’s web contains a staggering number of resources, most of which are primary information resources published as web content, but many others are resources that stand for “in the world” physical resources. Most of these resources are being described by their users rather than by professionals or by their authors. These “at large” users are most often creating descriptions for their own benefit when they assign tags or ratings to web resources, and they are unlikely to use standard or controlled descriptors when they do so.^[36] The resulting variability can be a problem if creating the description requires judgment on the tagger’s part. Most people can agree on the length of a particular music file but they may differ wildly when it comes to determining to which musical genre that file belongs. Fortunately most web users implicitly recognize that the potential value in these “Web 2.0” or “user-generated content” applications will be greater if they avoid egocentric descriptions. In addition, the statistics of large sample sizes inevitably leads to some agreement in descriptions on the most popular applications because idiosyncratic descriptions are dominated in the frequency distribution by the more conventional ones.^[37]

We are not suggesting that professional descriptions are always of high quality and utility, and socially produced ones are always of low quality and utility.^[38] Rather, it is important to understand the limitations and qualifications of descriptions produced in each way. Tagging lowers the barrier to entry for description, making organizing more accessible and creating descriptions that reflects a variety of viewpoints. However, when many tags are associated with a resource, it increases recall while decreasing precision. (See “Resource Description by Users”)

Automated and Computational Resource Description

A picture’s EXIF file created by a digital camera records properties associated with the camera and its settings, as well as some properties of the photo-taking context. (See Figure: Contrasting Descriptions for a Work of Art. for an example.) Creating this highly detailed description by hand would be nearly impossible. The downside, however, is that the automated description does not capture the meaning of the photo; an automated picture description captures the time and place, but not that it is a picture of a honeymoon vacation. The difference between automated and human description is called the semantic gap (“The Semantic Gap”).

Any resource that is smart enough to collect data about its state or environment is creating resource descriptions automatically (See “Active or Operant Resources”). Resources with computational capabilities can process the raw sensor data to identify important events and create more interpretable descriptions.

Some computational approaches create resource descriptions that are similar in purpose to those created by human describers. Text mining and summarization systems for customer comments about products can reduce thousands of comments to a list of the most important features.^[39] People shopping for books at Amazon.com get insights about a book’s content and distinctiveness from the statistically improbable phrases that it has identified by comparing all the books for which it has the complete text.^[40]

Computational descriptions can use any observable or latent variable (see the sidebar, Latent Feature Creation and Netflix Recommendations) except some that are prohibited by law, such as race, religion, national origin, and marital status, to prevent discrimination. In practice, however, this prohibition is easily circumvented because these properties can usually be predicted using other ones. For example, race can often be reliably predicted using residence address and surname.^[41]

Of course, all information retrieval systems compare a description of a user’s needs with descriptions of the resources that might satisfy them. IR systems differ in the resource properties they emphasize; word frequencies and distributions for documents in digital libraries, links and navigation behavior for web pages, acoustics for music, and so on. These different property descriptions determine the comparison algorithms and the way in which relevance or similarity of descriptions is determined. We say a lot more about this in “Describing Non-text Resources” and in Interactions with Resources.

Evaluating Resource Descriptions

Evaluation is implicit in many of the activities of organizing systems we described in Activities in Organizing Systems and is explicit when we maintain a collection of resources over time. In this section, we focus on the narrower problem of evaluating resource descriptions.

Evaluating means determining quality with respect to some criteria or dimensions. Many different sets of criteria have been proposed; for repositories of digital resources, the most commonly used ones are accuracy, completeness, and consistency.^[42] Other typical criteria are timeliness, interoperability, and usability. It is easy to imagine these criteria in conflict; efforts to achieve accuracy and completeness might jeopardize timeliness; enforcing consistency might preclude modifications and personalizations that would enhance usability.

What characteristics or criteria would you use to determine the quality of a car? Of food? Of clothing? Of a place to live? Which of these criteria are domain-specific, and which ones apply more generally to many types of resources?

The quality of the outcome of the multi-step process proposed in this chapter is a composite of the quality created or squandered at each step. A scope that is too granular or abstract, overly ambitious or vague intended purposes, a description vocabulary that is hard to use, or giving people inadequate time to create good descriptions can all cause quality problems, but none of these decisions is visible at the end of the process where users interact with resource descriptions.

Evaluating the Creation of Resource Descriptions

When professionals create resource descriptions in a centralized manner, which has long been the standard practice for many resources in libraries, there is a natural focus on quality at the point of creation to ensure that the appropriate controlled vocabularies and standards have been used. However, the need for resource description generalizes to resource domains outside of the traditional bibliographic one, and other quality considerations emerge in those contexts.

Resource descriptions in private sector firms are essential to running the business and in interacting efficiently with suppliers, partners, and customers. Compared to the public sector, there is much greater emphasis on the economics and strategy of resource description.^[43] What is the value of resource description? Who will bear the costs of producing them? Which of the competing industry standards will be followed? Some of these decisions are not free choices as much as they are constraints imposed as a condition of doing business with a dominant economic partner, which is sometimes a governmental entity.

For example, a firm like Wal-Mart with enormous market power can dictate terms and standards to its suppliers because the long-term benefits of a Wal-Mart contract usually make the initial accommodation worthwhile. Likewise, governments often require their suppliers to conform to open standards to avoid lock-in to proprietary technologies. ^[44]

In both the public and private sectors there is increased use of computational techniques for creating resource descriptions because the number of resources to be described is simply too great to allow for professional description. A great deal of work in text data mining, web page classification, semantic enrichment, and other similar research areas is already under way and is significantly lowering the cost of producing useful resource descriptions. Some museums have embraced approaches that automatically create user-oriented resource descriptions and new user interfaces for searching and browsing by transforming the professional descriptions in their internal collections management systems.^[45] Google’s ambitious project to digitize millions of books has been criticized for the quality of its algorithmically extracted resource descriptions, but we can expect that computer scientists will put the Google book corpus to good use as a research test bed to improve the techniques.^[46]

Web 2.0 applications that derive their value from the aggregation and interpretation of user-generated content can be viewed as voluntarily ceding their authority to describe and organize resources to their users, who then tag or rate them as they see fit. In this context the consistency of resource description, or the lack of it, becomes an important issue, and many sites are using technology or incentives to guide users to create better descriptions.

Evaluating the Use of Resource Descriptions

Regardless of, or in addition to, any quality criteria applied to the creation and selection of resource descriptions, at some point the resource descriptions meet their intended users. The most important quality criterion at that point is whether the resource descriptions satisfy their intended purposes in a usable way. In many ways, the answer is a disappointing no.

For example, in one of the earliest revisions to the original HTML specification, a <META> tag was added to allow creators of web resources to define a set of key terms to describe a website or web page. This well-motivated resource description was to be used by search engines to improve the relevance of retrieved pages. However, it soon became obvious that it was possible to “game” the META tag by adding popular terms even though they did not accurately describe the page. Today search engines ignore the <META> tag for ranking pages, but many other techniques that use false resource descriptions continue to plague web users. (See “Social and Web Curation”.)

The design of a description vocabulary circumscribes what can be said about a resource, so it is important to recognize that it implicitly determines what cannot be said as well, with unintended negative consequences for users. The resource description schema implemented in a physician’s patient management system defines certain types of recordable information about a patient’s visit—the date of the visit, any tests that were ordered, a diagnosis that was made, a referral to a specialist. The schema, and its associated workflow, impose constraints that affect the kinds of information medical professionals can record and the amount of space they can use for those descriptions. Moreover, such a schema might also eliminate vital unstructured space that paper records can provide, where doctors communicate their rationale for a diagnosis or decision without having to fit it into any particular box.

However, when resource descriptions are the data used to train models for prediction or classification, the focus of evaluation is not on the descriptions, which are often assumed to be accurate observations about the world. Instead, evaluation focuses on the model, and “model selection” is the task of choosing which of several competing models best fit the original data while also generalizing well to new data. In any event, any quality problems or selection biases with the original data will undermine the value of whatever model is selected.

The Importance of Iterative Evaluation

The inevitable conflicts between quality goals mean that there will be compromises among the quality criteria. Furthermore, increasing scale in an organizing system and the steady improvements of computational techniques for resource description imply that the nature of the compromise will change over time. As a result, a single evaluation of resource descriptions at one moment in time will not suffice.

This makes usage records, navigation history, and transactional data extremely important kinds of resource descriptions because they enable you to focus efforts on improving quality where they are most needed. Furthermore, for organizing systems with many types of resources and user communities, this information can enable the tailoring of the nature and extent of resource description to find the right balance between “rich and comprehensive” and “simple and efficient” approaches. Each combination of resource type and user community might have a different solution.

The idea that quality is a property of an end-to-end process is embodied in the “quality movement” and statistical process control for industrial processes but it applies equally well to resource description. The central idea is that quality cannot be tested in by inspecting the final products. Instead, quality is achieved through process control—measuring and removing the variability of every process needed to create the products.^[47] Explicit feedback from users or implicit feedback from the records of their resource interactions needs are essential as we iterate through the design process and revisit the decisions made there.

Because the relational database schema serves as a model for the creation of resource descriptions, it is designed to restrict the descriptions to be simple and completely regular sets of attribute-value pairs. The database schema specifies the overall structure of the tables and especially their columns, which will contain the attribute values that describe each resource. An employee table might have columns for the attributes of employee ID, hiring date, department, and salary. A date attribute will be restricted to a value that is a date, while an employee salary will be restricted according to salary ranges established by the human resources department. This makes the name of the attribute and the constraints on attribute values into resource descriptions that apply to the entire class of resources described by the table.

It is often necessary to associate some descriptions with individual resources that are specific to that instance and other kinds of descriptions that reflect the abstract class to which the instance belongs. When a typical car comes off the assembly line, it has only one instance-level description that differentiates it from its peers: its vehicle identification number (VIN). Specific cars have individualized interior and exterior colors and installed options, and they all have a date and location of manufacture. Other description elements have values that are shared with many other cars of the same model and year, like suggested price and the additional option packages, or configurations that can be applied to it before it is delivered to a customer. Alternatively, any descriptive information that applies to multiple cars of the same model year could be part of a resource description at that level that is referred to rather than duplicated in instance descriptions.

↵
Web services are generally implemented using XML documents as their inputs and outputs. The interfaces to web services are typically described using an XML vocabulary called Web Services Description Language(WSDL). See (Erl 2005b), especially Ch. 3, Introduction to Web Services Technologies.

↵
Creating descriptions that can keep pace with the growth of a collection has been an issue for librarians for years, as libraries moved away from describing simply “whatever came across a cataloger’s desk” to cataloging resources for a national and even international audience (Svenonius 2000, p. 31).

↵
The AACR2 includes rules for books, pamphlets, and printed sheets; cartographic materials; manuscripts and manuscript collections; music; sound recordings; motion pictures and video recordings; graphic materials; electronic resources; 3-D artifacts; microforms; and continuing resources. The Concise AACR2 (4th Edition) is the most accessible treatment of these very complex rules (Gorman 2004). The Resource Description and Access(RDA) vocabularies are the successor to AACR2 and make even finer distinctions among resource types. See the RDA content, media, and carrier value lists.

↵
We can also view the Dublin Core as part of the intellectual foundations for the “crowdsourcing” or “community curation” of resource descriptions by non-professionals (“Social and Web Curation”). See the Dublin Core Metadata Initiative (DCMI) at http://dublincore.org/.

↵
The semantic “bluntness” of a minimalist vocabulary is illustrated by the examples for use of the “creator” element in an official Dublin Core user guide (Hillmann 2005) that shows “Shakespeare, William” and “Hubble Telescope” as creators.

↵
The Intel Core 2 Duo Processor has detailed specifications (http://www.intel.com/products/processor/core2duo/specifications.htm) and seven categories of technical documentation: application notes, datasheets, design guides, manuals, updates, support components, and white papers (http://www.intel.com/design/core2duo/documentation.htm).

↵
Real estate advertisements are notorious for their creative descriptions; a house “convenient to transportation” is most likely next to a noisy highway, and a house in a “secluded location” is in a remote and desolate part of town.

↵
In its early days. when US consumers were generally unaware that Sony was a Japanese company and the quality of Japanese products was viewed in a negative light, Sony would make the “Made in Japan” label as inconspicuous as it could get away with. (John 1999)

In the summer of 2015, the consumer advocacy organization Truth in Advertising reported finding on Walmart’s website over 100 product descriptions inaccurately presenting the products as being made in the United States. (See https://www.truthinadvertising.org/walmart-made-in-usa/)

↵
Findings from a study of four online dating services (Toma et al 2008) found that 81% of people lied about at least one characteristic. Men were more likely to lie about height, while women lied more about weight, and the further their actual heights and weights were from the mean, the more they lied. A later study (Hall et al 2010) confirmed the finding for women and weight but also found that men are highly likely to misrepresent their personal assets.

↵
We encountered FRBR several times in previous chapters, especially in “Identity and Bibliographic Resources”, where we asked “What is this thing we call ‘Macbeth’?” and which describes FRBR four-level abstraction hierarchy of the work, depicted in Figure: The FRBR Abstraction Hierarchy.

The FRBR framework is the most recent formalization of the purposes of resource description that started in nineteenth-century libraries. This long history means it is not surprising that how we think about resource description still shows some bias toward interactions with physical bibliographic resources and the descriptions needed to obtain them.

There are three members of the “FRBR family.” One member is the “Group 1 entities” —Work, Expression, Manifestation, and Item—which are used to define classes of intellectual products and their relationships to each other. The “Group 2 entities” are responsible for the creation or custodianship of the Group 1 entities — Person, Family, and Corporate Body. The Group 2 model and requirements have been developed further as the Functional Requirements for Authority Data(FRAD) to enable catalogs to answer questions about the relationships between corporate entities or collaborators; see http://www.ifla.org/publications/functional-requirements-for-authority-data and http://archive.ifla.org/VII/d4/wg-franar.htm#Authority. The “Group 3 entities” are the subject descriptions for Group 1 and Group 2; see the Functional Requirements for Subject Authority Data(FRSAD) at http://www.ifla.org/node/1297.

For bibliographic resources these interactions and the models of resource descriptions needed to support them have been formalized as the Functional Requirements for Bibliographic Records(FRBR). We encountered FRBR several times in previous chapters (especially in “Bibliographic Resources, Information Components, and “Smart Things” as Resources”) where we asked “What is this thing we call ‘Macbeth’?” and described the FRBR four-level abstraction hierarchy of the work.

↵
In the very busy and dangerous environment of an aircraft carrier flight deck, the sailors wear vests and shirts that are color-coded to their jobs. For example, red shirts handle munitions, purple shirts handle fuel, green shirts run the catapults and hooks that launch and land the jets, and yellow shirts manage the flights. Color makes it faster and takes less attention for people to see if the right people are where they are supposed to be

The official Navy color chart for aircraft carrier personnel is available at http://www.navy.mil/navydata/ships/carriers/rainbow.asp

A similar principle is used in some sports; goalies wear different color jerseys to make it easier to enforce position-specific rules, and football quarterbacks wear distinctive practice jerseys to remind defensive players not to tackle them and possibly injure them.

It is worth noting that color blindness affects approximately 7% of the population.

↵
The Creative Commons nonprofit organization defines six kinds of copyright licenses that differ in the extent they allow commercial uses or modifications of an original resource (see http://creativecommons.org/licenses/ ). The Flickr photo sharing application is a good example of a site where a search for reusable resources can use the Creative Commons licenses to filter the results (http://www.flickr.com/creativecommons/ ).

↵
Using the same standards to describe products or to specify the execution of business processes can facilitate the implementation and operation of information-intensive business models because information can then flow between services or firms without human intervention. In turn this enables the business to become more demand or event-driven rather than forecast driven, making it a more “adaptive,” “agile,” or “on-demand” enterprise. See (Glushko and McGrath 2005), especially Ch. 5, How Models and Patterns Evolve.

↵
For new resources, the labor-intensive cost of traditional bibliographic description is less justifiable when you can follow a link from a resource description to the digital resource it describes and quickly decide its relevance. That is, web search engines demonstrate that algorithmic analysis of the content of information resources can make them self-describing to a significant degree, reducing the need for bibliographic description.

↵
See the FRBR-Library Reference Model

↵
(Svenonius 2000, pages 18-19).

↵
Ken Holman’s Definitive XSLT and XPath (Holman 2001) is the book to get started on with XPath, and no one has taught more people about XPath than Holman. The first five hours of a 24-hour video course on Practical Transformation Using XSLT and XPath is available for free at http://www.udemy.com/practical-transformation-using-xslt-and-xpath.

↵
(PREMIS 2011). The Preservation Metadata Implementation Strategies(PREMIS) standard for preservation metadata is maintained by the US Library of Congress at http://www.loc.gov/standards/premis/. A good place to start is the 2011 PREMIS Data Dictionary(http://www.loc.gov/standards/premis/v2/premis-2-1.pdf ).

↵
(Lockyer 1893) and (Bell 1970)

The joint story of Brahe’s data collecting and Kepler’s analysis and theorizing is told in an entertaining manner in (Ferguson 2002). An equally fascinating analysis that interprets Kepler’s conceptual shifts with a model of analogical reasoning is (Gentner et al., 1997).

↵
The concept of sensemaking originated from business school research in management and organizational theory (Weick 1995) but has been widely employed by ethnographers in many contexts including emergency rooms, classrooms with minority students, airline safety inspections, and crime investigation. See (Weick 2005) and (Chater 2016)

↵
Occam’s Razor has a long tradition in scientific philosophy, but some people have argued that it is overrated as a heuristic for choosing among alternative explanations or theories, particularly because it depends on how you define simplicity.

↵
One way to make simplicity more useful as a guide for choosing between mathematical models is to explicitly penalize those that are more complex by adding error to the predictions, a technique that computer scientists have given the non-intuitive name of regularization. This penalty requires complex models to be significantly better at explaining the data than simpler ones because they have to overcome the added errors.

↵
For example, the composition of a chair is presented here as a static intrinsic property but in fact a wooden chair might deteriorate over time as a result of exposure to sunlight, heat, or biological agents that attack it. A skill can be considered intrinsic and dynamic, but it might also be highly dependent on context, making it extrinsic. The subject category assigned to a book is extrinsic and static, but if the classification system is revised the book might be reclassified. Finally, while the location of a resource can be extrinsic and dynamic, the location history of the resource at some specific point in time is a fact, an intrinsic and static property.

↵
(Dey 2001) further defines the “environment” of context as places, people, and things, and for each of “entities” there are four categories of context information: location, identity, status (or activity), and time. This framework thus yields 12 dimensions for describing the context of an environment.

↵
A fascinating story about the Netflix's design and use of tens of thousands of movie sub-genres in its recommendation system is (Madrigal 2014).

↵
Consider how many events are named by appending a “-gate” suffix to imply that there is something scandalous or unethical going on that is being covered up. This cultural description is not immediately meaningful to anyone who does not know about the break-in at the headquarters of the Democratic National Committee headquarters at the Watergate hotel and subsequent cover-up that led to the 1974 resignation of US President Richard Nixon. A list of “-gate” events is maintained at http://en.wikipedia.org/wiki/List_of_scandals_with_%22-gate%22_suffix.

↵
(http://en.wikipedia.org/wiki/Holbein_carpet ).

↵
(Svenonius 2000, Ch. 5).

↵
(Laskey 2005).

↵
http://www.ncbi.nlm.nih.gov/mesh/.

↵
We cannot cite all of mathematical statistics in one short endnote, but if you are inclined to learn more, (Mardia, Kent, and Bibby 1980) and (Lee and Verleysen 2007) are the kindest and gentlest resources. If we look very generously at “dimensionality reduction” we might even consider the indexing step of eliminating “stop words” to be a form of dimensionality reduction. Stop words appear with such high frequency that they have no discriminating power, so they are discarded from queries and not part of the description of the indexed documents.

↵
(Taylor and Joudrey 2009), p. 91.

↵
See Ch. 4 of (Buttcher, Clarke, and Cormack 2010) for a description of a simple Boolean information retrieval model and Chapter 14 and 15 for descriptions of Google-scale ones. For a popular discussion of the Google algorithm see (Levy 2010).

↵
Many institutional organizing systems are subject to a single centralized or governmental authority that can impose principles for describing and arranging resources. Examples of organizing systems where resources are described using standard centralized principles are:

Libraries that use national bibliographic standards to satisfy requirements set by industry associations or other accreditation bodies such as the Association of College and Research Libraries (ACRL). (http://ala.org/acrl )

Companies that follow industry standards for information or process models, product classification or identification to be eligible for government business (Shah and Kesan 2006).

Legislative documents that conform to National or European Community standards for structure, naming, and description (Biasiotti 2008).

The Internet Corporation for Assigned Names and Numbers(ICANN) and its policies for operating the Domain Name System(DNS) make it possible for every website to be located using its logical name (like “berkeley.edu” rather than using an IP address like 169.229.131.81). (http://www.icann.org/ )

In other domains multiple organizations or institutions have the authority to impose principles of resource description. Sometimes this authority derives from the voluntary collaboration of multiple autonomous parties who set and conform to standards because they benefit from being able to share resources or information about resources. Examples of organizing systems where resources are described using standardized decentralized principles are:

Firms that establish company-wide standards for their information resources, typically including the organization and management of source content, document type models, and a style guide that applies to print and web documents.

Firms that participate in the OASIS (http://www.oasis-open.org/ ) or the W3C (http://www.w3.org/ ) industry consortia to establish specifications or technical recommendations for their information systems or web services).

↵
Many organizing systems describe and arrange their physical or information resources in ad hoc ways because the person or institution determining the arrangement is completely autonomous. This is the domain of organizing systems embraced by David Weinberger in Everything is Miscellaneous (Weinberger 2007).

↵
(Sen et al. 2006) analyze the effects of four tag selection algorithms used in sites that allow user tags on vocabulary evolution (more often called “tag convergence” in the literature), tag utility, tag adoption, and user satisfaction.

↵
But in an often-cited essay (Doctorow, 2001) provocatively titled Metacrap: Putting the torch to seven straw-men of the meta-utopia, Cory Doctorow argues that much human-created metadata is of low quality because “people lie, people are lazy, people are stupid, mission impossible—know thyself, schemas are not neutral, metrics influence results, (and) there is more than one way to describe something.”

↵
(Hu and Lui 2004).

↵
http://www.amazon.com/gp/search-inside/sipshelp.html/.

↵
The title says it all: Predictive analytics: The power to predict who will click, buy, lie, or die. (Siegel 2013). The Bayesian Surname and Geocoding technique for predicting race is described by (Elliott et al. 2008).

↵
(Park 2009).

↵
However, these concerns are rapidly becoming more important in the public sector. In particular, many public universities in the US are struggling with cuts in state and federal funding that are affecting library services and practices.

↵
More generally, economists use the concept of the “mode of exchange” in a business relationship to include the procedures and norms that govern routine behavior between business partners. An “exit” mode is one in which the buyer makes little long-term commitment to a supplier, and problems with a supplier cause the buyer to find a new one. In contrast, in “voice” mode there is much greater commitment and communication between the parties, usually leading to improved processes and designs. See (Helper and McDuffie 2003).

↵
(Schmitz and Black 2008).

↵
(Nunberg 2009) called the quality of Google’s metadata “a disaster for scholars,” but (Sag 2012) argues that the otherwise neglected “orphan works” in the Google corpus are “grist for the data mill.”

↵
The modern “quality movement” grew out of the efforts of the US to rebuild Japan after the Second World War and its “Bible” was Juran’s 1951 Quality Control Handbook (Juran 1951).

↵

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Discipline of Organizing: 4th Professional Edition Copyright © 2020 by Robert J. Glushko is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.