When we talk about organizing systems, we often do so in terms of the contents of their collections. This implies that the most fundamental decision for an organizing system is determining its resource domain, the group or type of resources that are being organized. This decision is usually a constraint, not a choice; we acquire or encounter some resources that we need to interact with over time, and we need to organize them so we can do that effectively.
Selecting is the process by which resources are identified, evaluated, and then added to a collection in an organizing system. Selection is first shaped by the domain and then by the scope of the organizing system, which can be analyzed through six interrelated aspects:
the number and nature of users
the time span or lifetime over which the organizing system is expected
the size of the collection
the expected changes to the collection
the physical or technological environment in which the organizing system is situated or implemented
the relationship of the organizing system to other ones that overlap with it in domain or scope.
(In The Organizing System Roadmap, we discuss these six aspects in more detail.)
Selection must be an intentional process because, by definition, an organizing system contains resources whose selection and arrangement was determined by human or computational agents, not by natural processes. And given the broad definition of resource as “anything of value that can support goal-oriented activity” it follows that resources should be selected by an implicit or explicit assessment to determine whether they can be used to achieve those goals. So even though particular selection methods and criteria vary across resource domains, their common purpose is to determine how well the resource satisfies the specifications for the properties or capabilities that enable a person or nonhuman agent to perform the intended activities. “Fitness for use” is a common and concise way to summarize this idea, and while it highlights the need to have activities in mind before resources are selected to enable them, it also explains why precise selection criteria are harder to define for organizing systems that have diverse sets of stakeholders or users with different goals, like those in public libraries.
Many resources are evaluated and selected one-at-a-time. This makes it impossible to specify in advance every property or criterion that might be considered in making a selection decision, especially for unique or rare resources like those being considered by a museum or private collector. In general, when resources are treated as instances, organizing activities typically occur after selection takes place, as in the closet organizing system with which we began this chapter.
When the resources being considered for a collection are more homogeneous and predictable, it is possible to treat them as a class or set, which enables selection criteria and organizing principles to be specified in advance. This makes selection and organizing into concurrent activities. This would be the case in the data warehouse organizing system, the other example at the beginning of this chapter, because each data source can be described by a schema whose structure is reflected in the organization of the data warehouse. Put another way, as long as subsequent datasets from a specific source do not differ in structure, only in temporal attributes like their creation or acquisition dates, the organization imposed on the initial dataset can be replicated for each subsequent one.
Well-run companies and organizations in every industry are highly systematic in selecting the resources that must be managed and the information needed to manage them. “Selecting the right resource for the job” is a clichéd way of saying this, but this slogan nonetheless applies broadly to raw materials, functional equipment, information resources and datasets, and to people, who are often called “human resources” in corporate-speak.
For some types of resources, the specifications that guide selection can be precise and measurable. Precise specifications are especially important when an organizing system will contain or make use of all resources of a particular type, or if all the resources produced from a particular source become part of the organizing system on some regular schedule. Selection specifications can also be shaped by laws, regulations or policies that require or prohibit the collection of certain kinds of objects or types of information.
For example, when a manufacturer of physical goods selects the materials or components that are transformed into its products, it carefully evaluates the candidate resources and their suppliers before making them part of its supply chain. The manufacturer would test the resources against required values of measurable characteristics like chemical purity, strength, capacity, and reliability. A business looking for transactional or demographic data to guide a business expansion strategy would specify different measurable characteristics; data files must be valid with respect to a schema, must contain no duplicates or personally identifiable information, and must be less than one month old when they are delivered. Similarly, employee selection has become highly data-intensive; employers hire people after assessing the match between their competencies and capabilities (expressed verbally or in a resume, or demonstrated in some qualification test) and what is needed to do the required activities.
Selection is an essential activity in creating organizing systems whose purpose is to combine separate web services or resources to create a composite service or application according to the business design philosophy of Service Oriented Architecture(SOA). When an information-intensive enterprise or application combines its internal services with ones provided by others via Application Programming Interfaces (APIs), the resources are selected to create a combined collection of services according to the “core competency” principle: resources are selected and combined to exploit the first party’s internal capabilities and those of its service partners better than any other combination of services could. For example, instead of writing millions of lines of code and collecting detailed maps to build an interactive map in an application, you can get access to the Google Maps organizing system with just a few lines of code. (See the sidebar, Selection of Web-based Resources)
Scientific and business data are ideally selected after assessments of their quality and their relevance to answering specific questions. But this is easy to say and hard to do. It is essential to assess the quality of individual data items to find data entry problems such as misspellings and duplicate records, or data values that are illegal, statistical outliers, or otherwise suspicious. It is also essential to assess the quality of data as a collection to determine if there are problems in what data was collected, by whom or how it was collected and managed, the format and precision in which it is stored, whether the schema governing each instance is rigorous enough, and whether the collection is complete. In addition, copyright, licensing, consumer protection laws, competitive considerations, or simply the lack of incentives to share resources make it difficult to obtain the best or most appropriate resources. (See the sidebar, Assessing and Addressing Data Quality)
In some domains, the nature of the resources or the goals they are intended to satisfy imply selection criteria that are inherently less quantifiable and more subjective. This is easy to see in personal collections, where selection criteria can be unconventional, idiosyncratic, or otherwise biased by the subjective perspective and experience of the collector. Most of the clothes and shoes you own have a reason for being in your closet, but could anyone else explain the contents of your closet and its organizing system, and why you bought that crazy-looking dress or shirt?
Both libraries and museums typically formalize their selection principles in collection development policies that establish priorities for acquiring resources that reflect the people they serve and the services they provide to them. The diversity of user types in public libraries and many museums implies that narrowly-targeted criteria would produce a collection of resources that would fail to satisfy many of the users. As a result, libraries typically select resources on the basis of broader criteria like their utility and relevance to their user populations, and try to choose resources that add the most value to their existing collections, given the cost constraints that most libraries are currently facing. Museums often emphasize intrinsic value, scarcity, or uniqueness as selection criteria, even if the resources lack any contemporary use.
Even when selection criteria can be measured and evaluated in isolation, they are often incompatible or difficult to satisfy in combination. It would be desirable for data to be timely, accurate, complete, and consistent, but these criteria trade off against one other, and any prioritization that values one criterion over another is somewhat subjective. In addition, explicitly subjective perceptions of resource quality are hard to ignore; people are inclined to choose resources that come in attractive packages or that are sold and supported by attractive and friendly people.
Many of the examples in this section have involved selection principles whose purpose was to create a collection of desirable, rare, skilled, or otherwise distinctive resources. After all, no one would visit a museum whose artifacts were ordinary, and no one would watch a sports team made up of randomly chosen athletes because it could never win. However, choosing resources by randomly sampling from a large population is essential if your goal is to make inferences about it without having to study all its instances. Sampling is especially necessary with very large populations when timely decisions are required. A good sample for statistical purposes is one in which the selected resources are not different in any important way from the ones that were not selected.
Sampling is also important when large numbers of resources need to be selected to satisfy functional requirements. A manufacturer cannot test every part arriving at the factory, but might randomly test some of them from different shipments to ensure that parts satisfy their acceptance criteria.
Looking “Upstream” and “Downstream” to Select Resources
As we have seen, selection principles and activities differ across resource domains, and there is another important difference in selection that considers resources from the perspective of their history or the future.
In “Selection Criteria” we discussed the activity of selecting resources by assessing their conformance with specifications for required properties or capabilities. However, if you can determine where the resources come from, you can make better selection decisions by evaluating the people, processes, and organizing systems that create them. Using the analogy of a river, we can follow a resource “upstream” from us until we find the “headwaters.” Physical resources might have their headwaters in a factory, farm, or artist’s studio. Digital resources might have headwaters in a government agency, a scientist’s laboratory, or a web-based commerce site.
When interaction resources (“The Concept of “Interaction Resource””) are incorporated into the organizing system that creates them, as when records of a person’s choices and behaviors are used to personalize subsequent information, the headwaters are obviously easy to find. However, even though finding the headwaters where resources come from is often not easy and sometimes not possible, that is where you are most likely to find the people best able to answer the questions, described in Design Decisions in Organizing Systems, that define any organizing system. The resource creators or producers will know the assumptions and tradeoffs they made that influence whether the resources will satisfy your requirements, and you can assess what they (or their documents that describe the resources) tell you and the credibility they have in telling it. You should also try to evaluate the processes or algorithms that produce the resources, and then decide if they are capable of yielding resources of acceptable quality.
The best outcome is to find a credible supplier of good quality resources. However, if an otherwise desirable supplier does not currently produce resources of sufficient quality, it is worth trying to improve the quality by changing the process using instruction or incentives. Advocates for open government have succeeded in getting numerous US government entities to publish data for free in machine-readable formats, but it was partly as a result of somewhat subversive demonstration projects and shaming that the government finally created data.gov in 2009. A clear lesson from the “quality movement” and statistical process control is that interventions that fix quality problems at their source are almost always a better investment than repeated work to fix problems that were preventable (see endnote[link to footnote]). But if you cannot find the headwaters or you are not able to address quality problems at their source, you can sometimes transform the resources to give them the characteristics or quality they need. (See the sidebar, Assessing and Addressing Data Quality, and “Transforming Resources for Interactions”.)
When you cannot obtain resources directly from their source, even if you have confidence in their quality at that point, it is important to analyze any evidence or records of their use or interactions as they flow downstream. (See “Resources over Time”) Physical resources are often associated with printed or digital documents that make claims about their origin and authenticity, and often have bar codes, RFID tags, or other technological mechanisms that enable them to be tracked from their headwaters to the places where they are used. Tracking is very important for data resources because they can often be added to, derived from, or otherwise changed without leaving visible traces. Just as the water from melted mountain snow becomes less pure as its flows downstream, a data resource can become “dirty” or “noisy” over time, reducing its quality from the perspective of another person or computational agent further downstream. Data often gets dirty when it is combined with other datasets that contain duplicate or seemingly-duplicate information. Data can also become dirty when the hardware or software that stores it changes. Subtle differences in representation formats, transaction management, enforcement of integrity constraints, and calculations of derived values can change the original data.
In addition, a data resource can become inaccurate or obsolete simply because the world that the data describes has changed with the passage of time. People move, change jobs, get married or divorced, or die. Likewise, companies move, merge, get spun off, or go out of business. A poll taken a year before an election is often not a good predictor of the ultimate winner.
Other selection processes look “downstream” to select resources on the basis of predicted rather than current properties, capability, or suitability. Sports teams often sign promising athletes for their minor league teams, and businesses hire interns, train their employees, and run executive development programs to prepare promising low-level managers for executive roles. Businesses sometimes conduct experiments with variable product offers and pricing to collect data they will need in the future to power predictive models that will repay the investment in data acquisition many times over.
Some governments attempt to preserve and prevent misappropriation of cultural property by enforcing import or export controls on antiquities that might be stolen from archaeological sites (Merryman 2006). For digital resources, privacy laws prohibit the collection or misuse of personally identifiable information about healthcare, education, telecommunications, video rental, and might soon restrict the information collected during web browsing.
The popular LinkedIn site, which has hundreds of millions of resumes that it data mines to find statistically superior job candidates, is literally a gold mine for the company because it makes money by referring those candidates to potential employers. Data-intensive hiring practices in baseball are entertainingly presented in the book entitled Moneyball book (Lewis 2003) or the 2011 movie starring Brad Pitt. Pro football teams have begun to assess college football players by comparing them statistically with the best pro players (Robbins, 2016).
Many examples of business strategies that required significant investment to acquire data assets with no current value are reported in (Provost and Fawcett 2013).
See (Cherbakov et al. 2005), (Erl 2005a). The essence of SOA is to treat business services or functions as components that can be combined as needed. An SOA enables a business to quickly and cost-effectively change how it does business and whom it does business with (suppliers, business partners, or customers). SOA is generally implemented using web services that exchange Extensible Markup Language(XML) documents in real-time information flows to interconnect the business service components. If the business service components are described abstractly it can be possible for one service provider to be transparently substituted for another—a kind of real-time resource selection—to maintain the desired quality of service. For example, a web retailer might send a Shipping Request to many delivery services, one of which is selected to provide the service. It probably does not matter to the customer which delivery service handles his package, and it might not even matter to the retailer.
The idea that a firm’s long term success can depend on just a handful of critical capabilities that cut across current technologies and organizational boundaries makes a firm’s core competency a very abstract conceptual model of how it is organized. This concept was first proposed by (Pralahad and Hamel 1990), and since then there have been literally hundreds of business books that all say essentially the same thing: you cannot be good at everything; choose what you need to be good at and focus on getting better at them; let someone else do things that you do not need to be good at doing.
See (Borgman 2000) on digitization and libraries. But while shared collections benefit users and reduce acquisition costs, if a library has defined itself as a physical place and emphasizes its holdings— the resources it directly controls—it might resist anything that reduces the importance of its physical reification, the size of its holdings, or the control it has over resources (Sandler 2006). A challenge facing conventional libraries today is to make the transition from emphasizing creation and preservation of physical collections to facilitating the use and creation of knowledge regardless of its medium and the location from which it is accessed.
(Arasu et al. 2001), (Manning et al. 2008). The web is a graph, so all web crawlers use graph traversal algorithms to find URIs of web resources and then add any hyperlink they find to the list of URIs they visit. The sheer size of the web makes crawling its pages a bandwidth- and computation intensive process, and since some pages change frequently and others not at all, an effective crawler must be smart at how it prioritizes the pages it collects and how it re-crawls pages. A web crawler for a search engine can determine the most relevant, popular, and credible pages from query logs and visit them more often. For other sites, a crawler adjusts its “revisit frequency” based on the “change frequency” (Cho and Garcia-Molina 2000).
Web resources are typically discovered by computerized “web crawlers” that find them by following links in a methodical automated manner. Web crawlers can be used to create topic-based or domain-specific collections of web resources by changing the “breadth-first” policy of generic crawlers to a “best-first” approach. Such “focused crawlers” only visit pages that have a high probability of being relevant to the topic or domain, which can be estimated by analyzing the similarity of the text of the linking and linked pages, terms in the linked page’s URI, or locating explicit semantic annotation that describes their content or their interfaces if they are invokable services (Bergmark et al. 2002), (Ding et al. 2004).
FTC Fair Information Practice Principles say that consumer data collected for one purpose cannot be used for other purposes without the consumer’s consent. Sometimes called the consumer privacy bill of rights.
Large research libraries have historically viewed their collections as their intellectual capital and have policies that specify the subjects and sources that they intend to emphasize as they build their collections. See (Evans 2000). Museums are often wary of accepting items that might not have been legally acquired or that have claims on them from donor heirs or descendant groups; in the USA, much controversy exists because museums contain many human skeletal remains and artifacts that Native American groups want to be repatriated.
Adding a resource to a museum implies an obligation to preserve it forever, so many museums follow rigorous accessioning procedures before accepting it. Likewise, archives usually perform an additional appraisal step to determine the quality and value of materials offered to them.
In archives, common appraisal criteria include uniqueness, the credibility of the source, the extent of documentation, and the rights and potential for reuse. To oversimplify: libraries decide what to keep, museums decide what to accept, and archives decide what to throw away.
See (Tauberer 2014) for a history of the “civic hacking” and the open data movement.
The Sunlight Foundation (
http://sunlightfoundation.com/) and Code For America (
https://www.codeforamerica.org/) are good sources for keeping up with open government issues and initiatives.
For a classification and review of data cleaning problems and methods, see (Rahm and Do, 2000). A recent and popular analysis that describes data cleaning as “data wrangling, data munging, and data janitor work” is (Lohr 2014). For a survey of anomaly detection see (Chandola 2009).