Reorganizing Resources for Interactions

Robert J. Glushko

65 Reorganizing Resources for Interactions

Once the scope and range of interactions is defined according to requirements and constraints, the resources and the technology of the organizing system have to be arranged to enable the implementation of the desired interactions.

Commonly, interactions are determined at the beginning of a development process of the organizing system. It follows that most required resource descriptions (which properties of a resource are documented in an organizing system) need to be clarified at the beginning of the development process as well; that is, resource descriptions are determined based on the desired interactions that an organizing system should support. Most of these processes have been described in detail in Resource Description and Metadata, Describing Relationships and Structures and The Forms of Resource Descriptions.

Resources from different organizing systems are often aggregated to be accessed within one larger organizing system (warehouses, portals, search engines, union catalogs, cross-brand retailers), which requires resources and resource descriptions to be transformed in order to adapt to the new organizing system with its extended interaction requirements.^[1] Elsewhere, legacy systems often need to be updated to accommodate new standards, technologies, and interactions (e.g, mobile interfaces for digital libraries). That means that the necessary resources and resource descriptions for an interaction need to be identified, and, if necessary, changes have to be made in the description of the resources. Sometimes, resources are merged or transformed in order to perform new interactions.

Identifying and Describing Resources for Interactions

Individual and collection resource descriptions need to be carefully considered in order to record the necessary information for the designed interactions. (See The Forms of Resource Descriptions.) The type of interaction determines whether new properties need to be derived or computed with the help of external factors and whether these properties will be represented permanently in the organizing system (e.g., an extended topical description added due to a user comment) or created on the fly whenever a transaction is executed (e.g., a frequency count).

Determining which resources or resource descriptions will be used in an interaction is simple when all resources are included (e.g., in a simple search interaction over all resources in a data warehouse). Sometimes resources need to be identified according to more selective criteria such as resources exhibiting a certain property (e.g., all restaurants in your neighborhood with four stars on Yelp in an advanced search interaction).

Transforming Resources for Interactions

When an organizing system and its interactions are designed with resources or resource descriptions from legacy systems with outdated formats or from multiple organizing systems or when the new organizing systems has a different purpose and requires different resource properties, resources and their descriptions need to be transformed. The processing and transformation steps required to produce the expected modification can be applied at different layers:

Infrastructure or notation transformation: When resources are aggregated, the organizing systems must have a common basic infrastructure to communicate with one another and speak the same language. This means that participating systems must have a common set of communication protocols and an agreed upon way of representing information in digital formats, i.e., a notation (“Notations”), such as the Unicode encoding scheme.^[2]
Writing system transformation: During a writing system transformation (The Forms of Resource Descriptions), the syntax or vocabulary—also called the data exchange format—of the resource description will be changed to conform to another model, e.g., when library records are mapped from the MARC21 standard to the Dublin Core format in order to be aggregated, or when information in a business information system is transformed into an EDI or XML format so that it can be sent to another firm.^[3] Sometimes customized vocabularies are used to represent certain types of properties. These vocabularies were probably introduced to reduce errors or ambiguity or abbreviate common organizational resource properties. These customized vocabularies need to be explained and agreed upon by organizations combining resources to prevent interoperability problems.
Semantic transformation: Agreeing on a category or classification system (Categorization: Describing Resource Classes and Types & Classification: Assigning Resources to Categories) is crucial so that organizing systems agree semantically—that is, so that resource properties and descriptions share not only technology but also meaning. For example, because the US Census has often changed its system of race categories, it is difficult to compare data from different censuses without some semantic transformation to align the categories.^[4]
Resource or resource description transformation: Resources or resource descriptions are often directly transformed, as when they are converted to another file format. In computer-based interactions like search engines, text resources are often pre-processed to remove some of the ambiguity inherent in natural language. These steps, collectively called text processing, include decoding, filtering, normalization, stopword elimination, and stemming. (See the sidebar, Text Processing)

Decoding: A digital resource is first a sequence of bits. Decoding transforms those bits into characters according to the encoding scheme used, extracting the text from its stored form. (See “Notations”.)
Filtering: If a text is encapsulated by formatting or non-semantic markup, these characters are removed because this information is rarely used as the basis of further interactions.
Tokenization: Segments the stream of characters (in an encoding scheme, a space is also a character) into textual components, usually words. In English, a simple rule-based system can separate words using spaces. However, punctuation makes things more complicated. For example, periods at the end of sentences should be removed, but periods in numbers should not. Other languages introduce other problems for tokenization; in Chinese, a space does not mark the divisions between individual concepts.
Normalization: Normalization removes superficial differences in character sequences, for example, by transforming all capitalized characters into lower-case. More complicated normalization operations include the removal of accents, hyphens, or diacritics and merging different forms of acronyms (e.g., U.N. and UN are both normalized to UN).
Stopword elimination: Stopwords are those words in a language that occur very frequently and are not very semantically expressive. Stopwords are usually articles, pronouns, prepositions, or conjunctions. Since they occur in every text, they can be removed because they cannot distinguish them. Of course, in some cases, removing stopwords might remove semantically important phrases (e.g., “To be or not to be”).
Stemming: These processing steps normalize inflectional and derivational variations in terms, e.g., by removing the “-ed” from verbs in the past tense. This homogenization can be done by following rules (stemming) or by using dictionaries (lemmatization). Rule-based stemming algorithms are easy to implement, but can result in wrongly normalized word groups, for example when “university” and “universe” are both stemmed to “univers.”

Transforming Resources from Multiple or Legacy Organizing Systems

The traditional approach to enabling heterogeneous organizing systems to be accessed together has been to fully integrate them, which has allowed the “unrestricted sharing of data and business processes among any connected applications and data sources” in the organization.^[5] This can be a strategic approach to improving the management of resources, resource descriptions, and organizing systems as a whole, especially when organizations have disparate systems and redundant information spread across different groups and departments. However, it can also be a costly approach, as integration points may be numerous, with vastly different technologies needed to get one system to integrate with another. Maintenance also becomes an issue, as changes in one system may entail changes in all systems integrating with it.^[6]

Planning the transformation of resources from different organizing systems to be merged in an aggregation is called data mapping or alignment. In this process, aspects of the description layers (most often writing system or semantics) are compared and matched between two or more organizing systems. The relationship between each component may be unidirectional or bidirectional.^[7] In addition, resource properties and values that are semantically equivalent might have different names (the vocabulary problem of “The Vocabulary Problem”). The purpose of mapping may vary from allowing simple exchanges of resource descriptions, to enabling access to longitudinal data, to facilitating standardized reporting.^[8] The preservation of version histories of resource description elements and relations in both systems is vital for verifying the validity of the data map.

Similar to mapping, a straightforward approach to transformation is the use of crosswalks, which are equivalence tables that relate resource description elements, semantics, and writing systems from one organizing system to those of another.^[9] Crosswalks not only enable systems with different resource descriptions to interchange information in real-time, but are also used by third-party systems, such as harvesters and search engines to generate union catalogs and perform queries on multiple systems as if they were one consolidated system.^[10]

In the digital library space, WorldCat allows users to access many library databases to locate items in their community libraries and, depending on patron privileges, to request items through their local libraries from libraries all over the world. For this powerful tool to accurately locate holdings in each library, two resource description standards are involved. At the book publisher, wholesaler, and retailer end, the international standard Online Information Exchange(ONIX) is used to standardize books and serials metadata throughout the supply chain.^[11] ONIX is implemented in book suppliers’ internal and customer-facing information systems to track products and to facilitate the generation of advance information sheets and supplier catalogs.^[12] At the library end, the Machine-Readable Cataloging(MARC) formats manage and communicate bibliographic and related information.^[13] When a member library acquires a title, information in ONIX format is sent from the supplier to the Online Computer Library Center(OCLC) where it is matched with a corresponding MARC record in the WorldCat database by using an ONIX to MARC crosswalk.^[14] This enables WorldCat to provide accurate real-time holdings information of its member libraries.

As the number of organizing systems increases, crosswalks and mappings become increasingly impractical if each pair of organizing systems requires a separate crosswalk. A more efficient approach would be the use of one vocabulary or format as a switching mechanism (also called a pivot or hub language) for all other vocabularies to map towards.^[15] Another possibility, which is often used in asymmetric power relationships between organizing systems, is to force all systems to adhere to the format that is used by the most powerful party.

Modes of Transformation

The conceptual relationships between different descriptions can be mapped out manually when creating simple maps. This, however, becomes more difficult as maps become more complex, due to the number of properties being mapped or when there are more structural or granularity issues to consider.

The use of automatic tools to create these alignments become vital in ensuring their accuracy and robustness. Graphical mapping tools provide users with a graphical user interface to connect description elements from source to target by drawing a line from one to the other.^[16] Other tools perform automatic mappings based on predetermined rules and criteria.^[17]

We often perform manual run-time transformations for decisions that require consulting more than one organizing system in our daily lives. For example, when planning a vacation, we use a variety of systems to negotiate a wide set of ad hoc requirements such as our resources and time, our fellow travelers and their availability, and the bookings for hotel and transportation, as well as desirable destinations and their various offerings. We somehow reconcile the different descriptions used in each of the systems and match these against each other so that the relevant information can be combined and compared. Even though the systems use different formats, vocabularies and structures, they are targeted toward human users and are relatively easy to interpret. For automatic run-time transformations, which need to be handled computationally, designers face the challenge of creating more structured processes for merging information from different systems.^[18]

The time of the transformation—at design time when organizing system resources are merged, or at run time when a certain interaction is performed— depends on the nature of the collaboration between organizing systems. Design-time transformations depend on highly cooperative environments where specific design requirements (like mapping rules and criteria) can be negotiated ahead of the system implementation. In cases where high-flexibility, ad hoc or real-time transformations would not be possible due to a lack of cooperation (such as the ShopStyle.com), run-time transformation processes may provide appropriate alternatives. Some low-level incompatibilities between organizing systems, such as the presence of syntactical, encoding, and particular structural and content issues, can also be rectified by implementing run-time transformation techniques, creating more loosely-coupled interoperating systems.

Granularity and Abstraction

Within writing system and semantic transformations, issues of granularity and level of abstraction (“Determining the Scope and Focus” and “Category Abstraction and Granularity”) pose the most challenges to cross-organizing system interoperability.^[19] Granularity refers to the level of detail or precision for a specific information resource property. For instance, the postal address of a particular location might be represented as several different data items, including the number, street name, city, state, country and postal code (a high-granularity model). It might also be represented in one single line including all of the information above (a low-granularity model). While it is easy to create the complete address by aggregating the different information components from the high-granularity model, it is not as easy to decompose the low-granularity model into more specific information components.

This does not mean, however, that a high-granularity model is always the best choice, especially if the context of use does not require it, as there are corresponding tradeoffs in terms of efficiency and speed in assembling and processing the resource information. (See the sidebar, AccuWeather Request Granularity)

The level of abstraction is the degree to which a resource description is abstracted from the concrete use case in order to fit a wider range of resources. For example, many countries have an address field called state, but in some countries, a similar regional division is called province. In order to accommodate both concepts, we can abstract from the original concrete concepts and establish a more abstract description of administrative region. Granularity and abstraction differences can occur at every resource property layer when resources need to be transformed; therefore, they need to be recognized and analyzed at every layer.

Requests for AccuWeather data have exploded in the last years, due to automated requests from mobile devices to keep weather apps updated. The company has dealt with this challenge by truncating the GPS coordinates sent by the mobile device when it requests weather data (a transformation to lower granularity). If the request with the truncated coordinates is identical to one recently made, a cached version of the content is served, resulting in 300 million to 500 million fewer requests a day.^[20]

Accuracy of Transformations

Automatic mapping tools can only be as accurate as the specifications and criteria that are included in the mapping guidelines. Intellectual checks and tests performed by humans are almost always necessary to validate the accuracy of the transformation. Because description systems vary in expressive power and complexity, challenges to transformations may arise from differences in semantic definitions, rules regarding whether an element is required or requires multiple values, hierarchical or value constraints, and controlled vocabularies.^[21] As a result of these complexities, absolute transformations that ensure exact mappings will result in a loss of precision if the source description system is substantially richer than the target system.

In practice, relative crosswalks where all elements in a source description are mapped to at least one target, regardless of semantic equivalence, are often implemented. This lowers the quality and accuracy of the mapping and can result in “down translation” or “dumbing down” of the system for resource description. As a result of mapping compromises due to different granularity or abstraction levels, transformations from different organizing systems usually result in less granular or specific resource descriptions. Consequently, whereas some interactions are now enabled (e.g., cross-organizing system search), others that were once possible can no longer be supported. For example, conflating geographical and person subject fields from one system (e.g., geographical subject = Alberta, person subject = Virginia) to a joint subject field (e.g. subject = Alberta, Virginia) to transform to the resource description of another system does not allow for searches that distinguish between these specific categories anymore.

Can you think of an example where resource description elements from one system are available for interaction in another due to a transformation, where the target system does not retain all the details of the descriptions in the source?

Major library system vendors now market so-called discovery portals to their customers, which allow libraries to integrate their local catalogs with central indexes of journal and other full-text databases. The advantages of discovery portals are the seamless access for patrons to all the library’s electronic materials (including externally licensed databases) while maintaining a local and customized look and feel. By providing out-of-the-box solutions, vendors on the other hand bind libraries more closely to their products.

See for an example Exlibris Primo (http://www.exlibrisgroup.com/category/PrimoOverview/) or OCLC WorldCat Local (http://www.oclc.org/worldcatlocal/default.htm).

↵
While data encoding describes how information is represented, and data exchange formats describe how information is structured, communication protocols refer to how information is exchanged between systems. These protocols dictate how these documents are enclosed within messages, and how these messages are transmitted across the network. Things such as message format, error detection and reporting, security and encryption are described and considered. Nowadays, there are a number of communication protocols that are used over networks, including File Transfer Protocol(FTP), Hypertext Transfer Protocol(HTTP) commonly used in the Internet, Post Office Protocol(POP) commonly used for e-mail, and other protocols under the Transmission Control Protocol/Internet Protocol(TCP/IP) suite. Different product manufacturers normally also have more proprietary protocols that they employ, including Apple Computer Protocols Suite and Cisco Protocols. In addition, different types of networks would also have corresponding protocols, including Mobile Wireless Protocols and such.

↵
Electronic Data Interchange(EDI), is used to exchange formatted messages between computers or systems. Organizations use this format to conduct business transactions electronically without human intervention, such as in sending and receiving purchase orders or exchange invoice information and such. There are four main standards that have been developed for EDI, including the UN/EDIFACT standard recommended by the United Nations(UN), ANSI ASC X12 standard widely used in the US, TRADACOMS standard that is widely used in the UK, and the ODETTE standard used in the European automotive industry. These standards include formats for a wide range of business activities, such as shipping notices, fund transfers, and the like. EDI messages are highly formatted, with the meaning of the information being transmitted being highly dependent on its position in the document. For instance, a line in an EDI document with BEG*00*NE*MOG009364501**950910*CSW11096^ corresponds to a line in the X12 standard for Purchase Orders (standard 850). “BEG” specifies the start of a Purchase Order Transaction Set. The asterisk (*) symbol delineates between items in the line, with each value corresponding to a particular field or information component described in the standard. “NE,” for example, corresponds to the Purchase Order Type Code, which in this instance is “New Order.” As can be seen in the example, the description of the information being transmitted is not readily available within the document. Instead, parties exchanging the information must agree on these formats beforehand, and need to ensure that the information instance is at the right position within the document so that the receiving party can correctly interpret it.

*EDI samples come from http://miscouncil.org.

American National Standards Association(ANSI) can be found at http://www.ansi.org.

↵
This and more examples for difficult categorizations can be found in: (Bowker and Star 2000).

↵
(Linthicum 1999).

↵
Allowing unrestricted access to data and business processes also becomes a problem when working across organizations. Fully integrating systems between two companies, for instance, may entail the exposure of business intelligence and information that should be kept private. This type of exposure is too much for most businesses, regardless of whether the relationship with the other business is collaborative rather than competitive. There are security issues to be considered, as collaborating organizations would need to access private networks and secure servers. The heterogeneity in supporting organizing systems along with the need to quickly evolve with the rapid changes in an organization’s competitive and collaborative environment has pushed organizations to shift from more vertical, isolated structures to a more loosely coupled, ecosystem paradigm This has led to more componentized and modularized systems that need only to exchange information or transform resources when an interaction requires it.

The emerging paradigm then is to enable independent systems to interoperate, or to have “the ability of two or more systems or components to exchange information and to use the information that has been exchanged.” Because the focus is in the exchange of resources or resource descriptions, independent systems need not necessarily know other systems’ underlying logic or implementation, for example, how they store resources. What is important is knowing what kind of resource is expected and in what format (notation, writing system, semantics), and what kind of information is returned for a particular. This is a strategic approach to exchanging resources, as systems can remain highly independent of each other. Changes in one system need not necessarily affect how other systems work as long as the information that is sent and received through an interface stays the same. This allows greater adaptability, as changes to system logic or business processes can be done in self-contained modules without necessarily affecting others. The transformation then happens in an intermediate space where the agreements on resource descriptions are fixed.

↵
To illustrate the difference between a unidirectional and bidirectional map, consider two systems, the Systematized Nomenclature of Medicine — Clinical Terms(SNOMED-CT) and the International Classification of Diseases, Tenth Revision, Clinical Modification(ICD-10-CM).

SNOMED-CT is a medical language system for clinical terminology maintained by the International Health Terminology Standards Development Organization(IHTSDO) and a designated electronic exchange standard for clinical health information for US Federal Government systems (http://www.nlm.nih.gov/research/umls/Snomed/snomed_main.html).

The ICD-10-CM, on the other hand, is an international diagnostic classification system for general epidemiological, health management, and clinical use maintained by the World Health Organization(WHO) and used for coding and classifying morbidity data from inpatient/outpatient records, physicians offices, and most National Center for Health Statistics(NCHS) surveys (http://www.who.int/classifications/icd/en/).

Because many different SNOMED-CT concepts can be mapped to a single ICD-9-CM code, a map in this direction cannot be used in reverse without introducing confusion and ambiguity.

↵
(McBride et al. 2006).

↵
(NISO 2004).

↵
http://journal.code4lib.org/articles/54 (Section 1.), http://www.dlib.org/dlib/june06/chan/06chan.html.

↵
(EDItEUR 2009a).

↵
(EDiTEUR 2009b).

↵
http://www.loc.gov/marc/.

↵
(Godby, Smith, and Childress 2008), Sections 1 and 2.

↵
Toward element-level interoperability in bibliographic metadata (Godby, Smith, and Childress 2008), Sec. 4.4, “Switching-Across.” Consider how the Getty has created a crosswalk called Categories for the Description of Works of Art(CDWA) to switch between eleven metadata standards, including Machine-Readable Cataloging/Anglo-American Cataloging Rules(MARC) and Dublin Core(DC). In this instance, the “Creation Date” element in CDWA is mapped to “260c Imprint — Date of Publication, Distribution, etc.” in MARC/AACR and to “Date.Created” in DC. Although this creates a two-step look-up in real-time, a direct mapping of this element from MARC/AACR to DC is no longer necessary for systems to interoperate.

↵
More commonly, graphical data mapping tools are included in an extract, transform, and load (ETL) database suite that provides additional powerful data transformation capabilities. Whereas data mapping is the first step in capturing the relationships between different systems, data transformation entails code generation that uses the resulting maps to produce an executable transformational program that converts the source data into target format. ETL databases extract the information needed from the outside sources, transform these into information that can be used by the target system using the necessary data mappings, and then loads it into the end system.

↵
Languages such as XSLT and Turing eXtender Language(TXL) facilitate the ease of data transformation while various commercial data warehousing tools provide varying functionalities such as single/multiple source acquisition, data cleansing, and statistical and analytical capabilities. Based on XML, XSLT is a declarative language designed for transforming XML documents into other documents. For example, XSLT can be used to convert XML data into HTML documents for web display or PDF for print or screen display. XSLT processing entails taking an input document in XML format and one or more XSLT style sheets through a template-processing engine to produce a new document.

↵
(Carney et al. 2005).

↵
For an in-depth discussion of interoperability challenges, see Chapter 6 of (Glushko and McGrath 2005).

↵
(AT&T 2011).

↵
(Chan and Zeng 2006). Section 4.3.

↵

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

The Discipline of Organizing: 4th Professional Edition Copyright © 2020 by Robert J. Glushko is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.