51 Introduction (VIII)
In Describing Relationships and Structures we discussed different types of semantic relationships and contrasted abstract relationships between categories that define a semantic hierarchy like
Meat → is-a → Food
with concrete relationships involving specific people like members of the Simpson family:
Homer Simpson → is-a → Husband
When we make an assertion that a particular instance like Homer Simpson is a member of class, we are classifying the instance.
Classification, the systematic assignment of resources to intentional categories, is the focus of this chapter. In Categorization: Describing Resource Classes and Types, we described categories created by people as cognitive and linguistic models for applying prior knowledge and we discussed a set of principles for creating categories and category systems. We explained how cultural categories serve as the foundations upon which individual and institutional categories are based. Institutional categories are most often created in abstract and information-intensive domains where unambiguous and precise categories enable classification to be purposeful and principled. Computational categories inherited by supervised learning techniques are usually as interpretable as those created by people, but categories created by unsupervised machine learning techniques are statistical patterns that might or might not be interpretable.
A system of categories and its attendant rules or access methods is typically called a classification scheme or just the classifications. A system of categories captures the distinctions and relationships among its resources that are most important in a domain and for a particular context of use, creating a reference model or conceptual roadmap for its users. This classification creates the structure and support for the interactions that human or computational agents perform. For example, research libraries and bookstores do not use the same classifications to organize books, but the categories they each use are appropriate for their contrasting types of collections and the different kinds of browsing and searching activities that take place in each context. Likewise, the scientific classifications for animals used by biologists contrast with those used in pet stores because the latter have no need for the precise differentiation enabled by the former.
Classification vs. Categorization
Classification requires a system of categories, so not everyone distinguishes classification from categorization. Batley, for example, says classification is “imposing some sort of structure on our understanding of our environment,” a vague definition that applies equally well to categorization.
In the discipline of organizing, the definition of classification is narrower and more formal. The contrasts among cultural, individual, and institutional categories in “The What and Why of Categories” yield a precise definition of classification: The systematic assignment of resources to a system of intentional categories, often institutional ones. This definition highlights the intentionality behind the system of categories, the systematic processes for using them, and implies the greater requirements for governance and maintenance that are absent for cultural categories and most individual ones.
Classification vs. Tagging
Precise and reliable classification is possible when the shared properties of a collection of resources are used in a principled and systematic manner. This method of classification is essential to satisfy institutional and commercial purposes. However, this degree of rigor might be excessive for personal classifications and for classifications of resources in social or informal contexts.
Instead, a weaker approach to organizing resources is to use any property of a resource and any vocabulary to describe it, regardless of how well it differentiates it from other resources to create a system of categories. This method of organizing resources is most often called tagging (“Tagging of Web-based Resources”), but it has also been called social classification.
Tagging is often used in personal organizing systems, but is social when it serves goals to convey information, develop a community, or manage reputation. Regardless of its name, however, tagging is popular for organizing and rating photos, websites, email messages, or other web-based resources or web-based descriptions of physical resources like stores and restaurants.
The distinction between classification and tagging was blurred when Thomas Vander Wal coined the term “folksonomy” —combining “folk” and “taxonomy” (which is a classification; see “Inclusion”) —to describe the collection of tags for a particular web site or application. Folksonomies are often displayed in the form of a tag cloud, where the frequency with which the tag is used throughout the site determines the size of the text in the tag cloud. The tag cloud emerges through the bottom-up aggregation of user tags and is a statistical construct, rather than a semantic one.
Tagging seems insufficiently principled to be considered classification. Tagging a photo as “red” or “car” is an act of resource description, not classification, because the other tags that would serve as the alternative classifications are unspecified. Furthermore, when tagging principles are followed at all, they are likely to be idiosyncratic ones that were not pre-determined or arrived at through an analysis of goals and requirements.
Noticeably, some uses of tags treat them as category labels, turning tagging into classification. Many websites and resources encourage users to assign “Like” or “+1” tags to them, and because these tags are pre-defined, they are category choices in an implied classification system; for example, we can consider “Like” as an alternative to a “Not liked enough” category.
(“Category Design Issues and Implications”). Some people use multiple user accounts for the same application to establish distinct personas or contexts (e.g., personal vs. business photo collections) as a way to make their tagsonomies more distinct.For example, a tagsonomy could predetermine tags as categories to be assigned to particular contents of a blog post, or specify the level of abstraction and granularity for assigning tags without predetermining them
Making these decisions about tagging content and form and applying them in the tagging process transforms an ad hoc set of tags into a principled tagsonomy. When tagging is introduced in a business setting, more pragmatic purposes and more systematic tagging—for example, by using tags from lists of departments or products—also tends to create tagsonomic classification.
“Tagging documents by computer,” or multi-label classification, is a glib way to describe topic modeling, an unsupervised learning technique for organizing and summarizing collections of unstructured documents by discovering patterns or clusters in the words they contain. The basic intuition behind topic modeling is that the words in a document are probabilistic indications of what the document is about; a document that contains words like “election, government, and candidate” is probably about the “politics” topic, while words like “adore, wedding, and marriage” are good indications of a “love” topic. Topic models are not quite tagging because the words they identify to describe documents are not atomic tags or labels explicitly assigned to individual documents. Instead, topics are more like themes that different documents are more or less likely to contain.
Topic models have been used to implement user interfaces for browsing large document collections because they let a user explore using themes instead of specific search terms. In digital humanities, topic models have been used to discover changes in “what’s written about” by some author or resource (like a newspaper) over time. Web commerce companies use topic models to organize books or products for their recommendation engines.
Classification vs. Physical Arrangement
We have often stressed the principle in the discipline of organizing that logical issues must be separated from implementation issues. (See “The Concept of “Organizing Principle””, “Designing the Description Form”, and “The Implementation Perspective ”) With classification we separate the conceptual act of assigning a resource to a category from the subsequent but often incidental act of putting it in some physical or digital storage location. This focus on the logical essence of classification is elegantly expressed in a definition by Gruenberg: Classification is “a higher order thinking skill requiring the fusion of the naturalist’s eye for relationships… with the logician’s desire for structured order… the mathematician’s compulsion to achieve consistent, predictable results… and the linguist’s interest in explicit and tacit expressions of meaning.”
Taking a conceptual or cognitive perspective on classification contrasts with much conventional usage in library science, where classification is mostly associated with arranging tangible items on shelves, emphasizing the “parking” function that realizes the “marking” function of identifying the category to which the resource belongs.
From a library science or collection curation perspective, it seems undeniable that when the resources being classified are physical or tangible things such as books, paintings, animals, or cooking pots, the end result of the classification activity is that some resource has been placed in some physical location. Moreover, the placement of physical resources can be influenced by the physical context in which they are organized. Once placed, the physical context often embodies some aspects of the organization when similar or related resources are arranged in nearby locations. In libraries and bookstores, this adjacency facilitates the serendipitous discovery of resources, as anyone well knows who has found an interesting book by browsing the shelves.
It might seem natural to identify storage locations with the classes used by the classification system. Just as we might think of a location in the zoo as the “lion habitat,” we can put a “QC” sign on a particular row of shelves in a library where books about physics are arranged.
However, once we broaden the scope of organizing to include digital resources, it is clear that we rely on their logical classifications when we interact with them, not whether they reside on a computer in Berkeley or Bangalore. It is better to emphasize that a classification system is foremost a specification for the logical arrangement of resources because there are usually many possible and often arbitrary mappings of logical references to physical locations.
A classification scheme is a realization of one or more organizing principles. Physical resources are often classified according to their tangible or perceivable properties. As we discussed in “Single Properties” and “Multiple Properties”, when properties take on only a small set of discrete values, a classification system naturally emerges in which each category is defined by one property value or some particular combination of property values. Classification schemes in which all possible categories to which resources can be assigned are defined explicitly are enumerative. For example, the enumerative classification for a personal collection of music recorded on physical media might have categories for CDs, DVDs, vinyl albums, 8-track cartridges, reel-to-reel tape, and tape cassettes; every music resource fits into one and only one of these categories.
When multiple resource properties are considered in a fixed sequence, each property creates another level in the system of categories and the classification scheme is hierarchical or taxonomic. (See “Inclusion”.)
For information resources, their aboutness is usually more important than their physical properties. For example, a professor planning a new course might organize candidate articles for the syllabus in a fixed set of categories, one for each potential lecture topic. But it is more challenging to enumerate all the subjects or topics that a larger collection of resources might be about. The Library of Congress Classification(LCC) is a hierarchical and enumerative scheme with a very detailed set of subject categories because books can be about almost anything. We discuss the LCC more in “Bibliographic Classification”.
In addition to or instead of their aboutness, information resources are sometimes organized using intrinsic properties like author names or creation dates. Our professor might primarily organize his collection of articles by author name, and when he plans a new course, he might put those he selects for the syllabus into a classification system with one category for every scheduled lecture.
Because names and dates can take on a great many values, an organizing principle like alphabetical or chronological ordering is unlikely to enumerate in advance an explicit category for each possible value. Instead, we can consider these organizing principles as creating an implicit or latent classification system in which the categories are generated only as needed. For example, the Q category only exists in an alphabetical scheme if there is a resource whose name starts with Q.
Many resource domains have multiple properties that might be used to define a classification scheme. For example, wine can be classified by type of grape (varietal), color, flavor, price, winemaker, region of origin (appellation), blending style, and other properties. Furthermore, people differ in their knowledge or preferences about these properties; some people choose wine based on its price and varietal, while others studiously compare winemakers and appellations. Each order of considering the properties creates a different hierarchical classification, and using all of them would create a very deep and unwieldy system. Moreover, many different hierarchies might be required to satisfy divergent preferences. An alternative classification scheme for domains like these is faceted classification, a type of classification system that takes a set of resource properties and then generates only those categories for combinations that actually occur.
The most common types of facets are enumerative (mutually exclusive); Boolean (yes or no); hierarchical or taxonomic (logical containment); and spectrum (a range of numerical values). We discuss faceted classification in detail (in “Faceted Classification”) because it is very frequently used in online classifications. Faceted schemes enable easier search and browsing of large resource collections like those for retail sites and museums than hierarchical enumerative schemes. In library science a classification system that builds categories by combination of facets is sometimes also called analytico-synthetic.
The Dewey Decimal Classification(DDC) is a highly enumerative classification system that also uses faceted properties; we will discuss it more in “Bibliographic Classification”.
Classification and Standardization
Classifications impose order on resources. Standards do the same by making distinctions, either implicitly or explicitly, between “standard” and “nonstandard” ways of creating, organizing, and using resources. Classification and standardization are not identical, but they are closely related. Some classifications become standards, and some standards define new classifications. Institutional categories (“Institutional Categories”) are of two broad types.
Institutional taxonomies are classifications designed to make it more likely that people or computational agents will organize and interact with resources in the same way. Among the thousands of standards published by the International Organization for Standardization(ISO) are many institutional taxonomies that govern the classification of resources and products in agriculture, aviation, construction, energy, healthcare, information technology, transportation, and almost every industry sector.
Institutional taxonomies are especially important in libraries and knowledge management. The Dewey Decimal Classification(DDC) and Library of Congress Classification(LCC) enable different libraries to arrange books in the same categories, and the Diagnostic and Statistical Manual of Mental Disorders(DSM) in clinical psychology enables different doctors to assign patients to the same diagnostic and insurance categories.
Systems of institutional semantics offer precisely defined abstractions or information components (“Identity and Information Components”) needed to ensure that information can be efficiently exchanged and used. Organizing systems that use different information models often cannot share and combine information without tedious negotiation and excessive rework.
Automating transactions with suppliers and customers in a supply chain requires that all the parties use the same data format or formats that can be transformed to be interoperable. Retrofitting or replacing these applications to enable efficient interoperability is often possible, and it is usually desirable for the firm to develop or adopt enterprise standards for information exchange models rather than pay the recurring transaction costs to integrate or transform incompatible formats.
Standard semantics are especially important in industries or markets that have significant network effects where the value of a product depends on the number of interoperable or compatible products—these include much of the information and service economies.
An example of a system of institutional semantics is the Universal Business Language(UBL) a library of about 2000 semantic “building blocks” for common concepts like “Address,” “Item,” “Payment,” and “Party” along with nearly 100 document types assembled from the standard components. UBL is widely used to facilitate the automated exchange of transactional documents in procurement, logistics, inventory management, collaborative planning and forecasting, and payment.
Specifications vs. Standards
Implementing an organizing system of significant scope and complexity in a robust and maintainable fashion requires precise descriptions of the resources it contains, their formats, the classes, relations, structures and collections in which they participate, and the processes that ensure their efficient and effective use. Rigorous descriptions like these are often called “specifications” and there are well-established practices for developing good ones.
There is a subtle but critical distinction between “specifications” and “standards.” Any person, firm, or ad hoc group of people or firms can create a specification and then use it or attempt to get others to use it. In contrast, a standard is a published specification that is developed and maintained by consensus of all the relevant stakeholders in some domain by following a defined and transparent process, usually under the auspices of a recognized standards organization. In addition, implementations of standards often are subject to conformance tests that establish the completeness and accuracy of the implementation. This means that users can decide either to implement the specification themselves or choose from other conforming implementations.
The additional rigor and transparency when specifications are developed and maintained through a standards process often makes them fairer and gives them more legitimacy. Governments often require or recommend these de jure standards, especially those that are “open” or “royalty free” because they are typically supported by multiple vendors, minimizing the cost of adoption and maximizing their longevity.
For example, work on UBL has gone on for over a decade in a technical committee under the auspices of a standards development consortium called the Organization for the Advancement of Structured Information Standards(OASIS), which has developed scores of standards for web services and information-intensive industries.
Despite these important distinctions between “specifications” and “standards,” however, in conventional usage “standard” is often simply a synonym for “dominant or widely-adopted specification.” These de facto standards, in contrast with the de jure standards created by standards organizations, are typically created by the dominant firm or firms in an industry, by a new firm that is first to use a new technology or innovative method, or by a non-profit entity like a foundation that focuses on a particular domain.
De facto standards and ad hoc standards often co-exist and compete in “standards wars,” especially in information-intensive domains and industries with rapid innovation. Standards “wars” tend to occur when different firms or groups of firms develop two or more standards that tend to address the same needs. Not surprisingly, the competing standards are often incompatible on purpose. At first this lets each standard attract customers with features not enabled by the other, but it ends up locking them in by imposing switching costs. Current examples include Google vs. Apple on mobile phones and Kindle versus Apple on ebook readers.
For example, the Dewey Decimal Classification(DDC) is the world’s most widely used library classification system, and most people treat it as a standard. In fact, the DDC is proprietary and it is maintained and licensed for use by the Online Computer Library Center(OCLC). Similarly, the DSM is maintained and published by the American Psychiatric Association(APA) and it earns the APA many millions of dollars a year.
In contrast, de jure standards include the Library of Congress Classification(LCC), developed under the auspices of the US government, the familiar MARC record format used in online library catalogs (ISO 2709), and its American counterpart ANSI Z39.2.
As a result, even though it would be technically correct to argue that “while all standards are specifications, not all specifications are standards,” this distinction is hard to maintain in practice.
Standards are often imposed by governments to protect the interests of their citizens by coordinating or facilitating activities that might otherwise not be possible or safe. Some of them primarily concern public or product safety and are only tangentially relevant to systems for organizing information. Others are highly relevant, especially those that specify the formats and content of information exchange; many European governments require firms doing business with the government to adopt UBL.
Other government standards that are important in organizing systems are those that express requirements for classification and retention of auditing information for financial activities, such as the Sarbanes-Oxley Act, or for non-retention of personal information, such as HIPAA and FERPA.
(Hammond et al. 2004) note that the “unstructured (or better, free structured) approach to classification with users assigning their own labels is variously referred to as a folksonomy, folk classification, ethnoclassification, distributed classification, or social classification.”
Thomas Vander Wal invented the term “folksonomy” in 2004, and the term quickly gained traction. His personal account of the creation and dispersion of the term is (Vander Wal 2007).
See (Halvey and Keane 2007), (Sinclair and Cardrew-Hall 2007)) for analyses of the usability of different presentations, and (Kaser and Lemire 2007) for algorithms for drawing tag clouds.
See (Millen, Feinberg, and Kerr 2006), (John and Seligmann 2006).
The statistical techniques used in topic models are intimidating; to vastly oversimplify, topic models start with a document x term matrix and extract topics by reducing the dimensionality through linear algebra techniques. (Blei 2012) is a relatively easy introduction.
Gruenberg wrote this definition over a decade ago as a University of Illinois PhD student in an unpublished paper titled Faceted Classification, Facet Analysis, and the Web that was found by a web search by the first author of this chapter in 2005. When this chapter was being written several years later, the paper was no longer on the web, but a copy was located at Illinois by Matthew Beth on a backup disk.
This is reflected in library call numbers, which assign a unique number to books to designate the order in which they are shelved. Most American libraries use a classification system as part of their call number, composing it from a class number of the classification and a unique identifier (derived from the author name and title), which identifies the book within the class, often using a system called Cutter numbers. See
The most “standard” of all standards organization is the International Organization for Standardization(ISO), whose members are themselves national standards organizations, which as a result gives the nearly 20,000 ISO standards the broadest and most global coverage. See
http://ISO.org. In addition, there are scores of other national and industry-specific standards bodies whose work is potentially relevant to organizing systems of the sorts discussed in this book. We encounter these kinds of standards every day in codes for countries, currencies, and airports, in file formats, in product barcodes, and in many other contexts.
Dewey Decimal Classification:
Similarly, the DSM is maintained and published by the American Psychiatric Association(APA) and it earns the many millions of dollars a year.
(OASIS 2006). All the finished work of OASIS is freely available at
https://www.oasis-open.org; the UBL committee is at
A small number of people can often informally agree on an organizing system that meets the needs of each participant. But each new person often brings new requirements and it is not feasible to resolve every disagreement between every pair of participants. Instead, for a large-scale organizing system, decisions are usually made by entities that have the authority to coordinate actions and prevent conflicts by imposing a single solution on all the participants. (Rosenthal, Seligman, and Renner 2004) call this the “person-concept” tradeoff, which we can paraphrase as “a few people can agree on a lot, but a lot of people can only agree on a little.”
This authority can come from many different sources, but they can be roughly categorized as “authority from power” and “authority from consensus.” Often the economic dominance of a firm allows it to control how business gets done in its industry. One key part of that is establishing specifications for data formats and classification schemes in organizing systems, which usually means requiring other firms to use the ones developed by the dominant firm for its own use. This ensures the continued efficiency of their own business processes while making it harder for other firms to challenge their market power.
In contrast, consensus is the authority mechanism embodied in the workings of the open source community, where the freedom to view and change data formats and code that uses them encourages cooperation and adoption. Consensus also underlies the authority of voluntary standards activities, where firms work together under the auspices of a standards body and agree to follow its procedures for creating, ratifying, and implementing standards.
International and national standards bodies derive their authority from the authority of the governments that created them. But standards organizations arguably derive most of their authority from the collective power of their members, because many influential standards organizations like OASIS, W3C, OMG, and IETF are not chartered or sponsored by governments. In addition, firms often create ad hoc “quasi-standards” organizations or “communities of interest” to facilitate relatively short-term cooperative standards-making activities that in the former case would otherwise be prohibited by anti-trust considerations. Finally, at the extreme “lightweight” end of the standards-making continuum, the codification of simple and commonly used information models as “microformats” depends on authority that emerges from the collaboration of individuals rather than firms.
Often a standard evolves from an existing specification submitted to a standards organization by the firm that created it. In other cases, the specifications used by a dominant firm becomes a de facto standard by other firms in its industry, and it is never submitted to a formal standards-making process.
See (Shapiro and Varian, 1998).
Even so, the LCC is not “open” standard. You can browse the classifications on the LOC site, but to get them packaged as a book or complete digital resource you have to pay for them.
Governments have inherently long time horizons for their actions, they need to serve all citizens fairly and without discrimination, and they (should seek to) minimize cost to taxpayers. Each of these principles is an independent argument for standards and taken together they make a very strong one. Indeed, one the founding goals in the US Constitution is to protect the public interest, and this is enabled in Article I, Section 8 by granting Congress the power to set standards “of Weights and Measures” to facilitate commerce. Setting standards is a key role of the National Institute of Standards and Technology(NIST), part of the Department of Commerce, and other departments have similar standards-setting responsibilities and agencies, like the Food and Drug Administration(FDA) in the Department of Health and Social Services. In addition, independent government agencies like the Federal Communications Commission(FCC) and Federal Trade Commission(FTC) set numerous standards that are relevant to information organizing systems. And of course, the Library of Congress(LOC) maintains procedures and standards needed “to sustain and preserve a universal collection of knowledge... for future generations” (LOC.gov/about).
The Sarbanes-Oxley Act is US Public Law 107-204,
The definitive source for the Health Insurance Portability and Accountability Act(HIPAA) is the US Department of Health & Human Services,
The definitive source for the Family Educational Rights and Privacy Act(FERPA) is the US Department of Education,
Complying with government regulations like these can be expensive and difficult, and many companies, especially smaller ones, complain about the cost. On the other hand, the argument can be made that investing in a rigorous system for organizing information can provide competitive advantages, turning the compliance burden into a competitive weapon (Taylor 2006).