13 Artificial Intelligence Part 1: How is AI Used in Research?
Rachael Samberg and Katie Zimmerman
Because licensing artificial intelligence (AI) usage and training rights can be so complex, and is best served by a grounding in how scholars use AI in their research, we have divided this training section into four chapters:
- Part 1: How is AI used in research? (this chapter)
- Part 2: How does the law govern AI use and training?
- Part 3: How can we license AI uses and training rights?
- Part 4: An AI negotiation case study
In the text and data mining chapter, we addressed the ways that some TDM or other computational research methodologies may need to incorporate artificial intelligence (AI) tools to yield results. We gave the example there of a scholar needing to train an AI tool with parameters of what a “happy” character looks like so that the AI tool could then search for other instances of happy characters in a larger body of literature. But there are countless other examples we could have given because, for more than a decade, scholars have relied on TDM with AI modeling, including for studies on: changes in gender significance in fiction; slave narratives and perceptions of religion; social and economic impact of Hurricane Katrina on Black New Orleans; the spread of conspiracy theories; racial disparities demonstrated by police officers; the representation of race, gender, and place in film and television; discovering new chemical syntheses; and much more. And there are likely increased uses of AI tools in research for non-TDM methodologies, as well.
Yet, on academic and library listservs, there has suddenly emerged an increasingly fraught discussion about licensing scholarly content when scholars’ research methodologies rely on AI tools because publishers have recently begun heavily restricting their usage. Indeed, libraries are now being presented with content license agreements that prohibit AI tools and training entirely, irrespective of scholarly purpose.
Conversely, publishers, vendors, and content creators—a group we’ll call “rightsholders” here—have expressed valid concerns about how their copyright-protected content is used in AI training, particularly in a commercial context unrelated to scholarly research. Rightsholders fear that their livelihoods are being threatened when generative AI tools are trained and then used to create new outputs that they believe could infringe upon or undermine the market for their works.
We believe that within the context of non-profit academic research, rightsholders’ fears about allowing AI training, and especially non-generative AI training, are misplaced. Publishers and libraries can and do implement sufficient safeguards in e-resource licenses to support scientific research while protecting rightsholders interests.
But what should that language be? It depends. That’s because the scope of AI needs will vary by institution. Some institutions will wish to preserve and negotiate for rights to use “homegrown” (i.e. institution- or scholar-developed) AI tools, and others may also or instead wish to preserve rights to use commercial AI tools developed by third parties. In both the “homegrown” and third-party AI tool contexts, an institution should also decide whether they want to preserve either or both non-generative and generative AI tool usage. (We discuss the distinction between non-generative and generative more below.) Ultimately, deciding what to preserve or negotiate for, and how hard to push for it, really depends on the institution but you want to be broad in your thinking to anticipate scholars’ needs.
It is safe to say that in all events, newly-emerging content license agreements that prohibit usage of AI entirely, or charge exorbitant fees for it as a separately-licensed right, will be devastating for scientific research and the advancement of knowledge. We aim in these AI chapters to empower academic librarians with legal information about why those licensing outcomes are unnecessary, and equip them with alternative licensing language to adequately address rightsholders’ concerns.
We will start with helping people understand how AI is currently being used in research.
What is AI?
The most common way we’ve seen AI used in research methodologies is in text and data mining—which refers generally to the reliance on computational tools, algorithms, and automated techniques to extract information from copyrighted works, or categorize or classify factors or relationships in or between such data sources. AI can help in this process.
AI refers to the capability of a computer system to perform cognitive-equivalent functions such as learning and problem-solving, through reliance on math and logic. There are subset applications of AI that reflect the component stages and processes of how the computer system gets to the point where it can perform those cognitive-equivalent functions. For instance, machine learning is “the process of using mathematical models of data to help a computer learn without direct instruction.” In other words, if an “intelligent” computer can perform cognitive-equivalent functions, machine learning is how that computer system develops this intelligence. Likewise, natural language processing is a branch of AI that involves the development of the computer’s ability to understand and communicate with human language through recognition, understanding, and generation of text and speech.
Some publishers may try to distinguish AI from its component processes, but it’s helpful to think of all of these learning steps and phases as part of the computer system’s overall “intelligence.”
How is AI used in research?
AI is being used in research in both a non-generative and generative capacity across the humanities, social sciences, and sciences.
Non-generative AI
In 2018, researchers trained an AI tool to understand whether a character is “masculine” or “feminine” by looking at the tacit assumptions expressed in words associated with that character. That tool can then look at other texts and identify masculine or feminine characters based on what it knows from having been trained before, and scholars can therefore use texts from different time periods to study representations of masculinity and femininity over time. No licensed content, no licensed or copyrighted books, and no licensed text from a publisher is being released to the world by the tool. The tool is simply performing what a person would in do in analysis but on a much larger body of works. And there are thousands of trained tools that exist that scholars have created to do things like detect gender, recognize faces, etc.
In 2023, a researcher published this paper demonstrating that, since a 2008 the California Supreme Court opinion which forced the state’s parole board to change how it justifies the decision to keep a person in prison to focus more on someone’s current and future level of dangerousness, the 9,842 parole hearing transcripts analyzed do indeed reflect a rehabilitative way of discussing whether parole is justified. The way the scholar determined this was by using a natural language processing technique known as word2vec. Essentially, in a word embedding model like word2vec, each unique word in the vocabulary of a corpus—in this case the transcripts— is represented as a vector of numbers based on how frequently it co-occurs alongside other “context” words. These vectors provide coordinates for locating (or “embedding”) words in a continuous, multidimensional embedding space. Words that are used in similar contexts tend to appear close to each other within this embedding space. For instance, and as the author points out, when analyzing newspapers, we would expect “cloudy” and “sunny” to appear in sentences discussing weather, meaning they would have a substantial overlap in their contexts—and be placed nearer each other in the embedding space—even though they don’t have similar definitions. This makes it easy to determine if a paragraph is, indeed, discussing the weather. Well in this paper, the scholar used word2vec with the parole hearing transcripts to create word embeddings. He then trained the embedding algorithms on the portions of the transcripts in which the parole commissioners announced and justified their decisions, and found that these discussions did indeed become more focused on rehabilitation concepts after 2008.
Generative AI
Each of the two examples above involve using non-generative AI tools in computational research—the AI tool isn’t creating new content. But scholars are also performing computational research by training generative AI tools.
For instance, all the way back in 2017, chemists trained a generative tool on 12,000 published research papers regarding synthesis conditions for metal oxides, so that the tool could identify anticipated chemical outputs and reactions for any given set of synthesis conditions entered into the tool. The tool they created is not capable of reproducing or redistributing any licensed content from the papers; it has merely learned conditions and outcomes and can predict chemical reactions based on those conditions and outcomes.
In 2023, scholars trained a third-party-created open-source natural language processing (NLP) tool called Chemical Data Extractor (CDE). Among other things, CDE can be used to extract chemical information and properties identified in scholarly papers. In this case, the scholars wanted to teach CDE to parse a specific type of chemical information: metal-organic frameworks, or MoFs. Generally speaking, the CDE tool works by breaking sentences into “tokens” like parts of speech and referenced chemicals. By correlating tokens, one can determine that a particular chemical compound has certain synthetic properties, topologies, reactions with solvents, etc. The scholars trained CDE specifically to parse MoF names, synthesis methods, inorganic precursors, and more—and then exported the results into an open source database that identifies the MoF properties for each compound. Anyone can now use both the trained CDE tool and the database of MoF properties to ask different chemical property questions or identify additional MoF production pathways—thereby improving materials science for all. Neither the CDE tool nor the MoF database reproduces or contains the underlying scholarly papers that the tool learned from.
Moreover, this work could be replicated in a closed-hosted environment pursuant to which a trained third-party tool is never distributed (such as with an enterprise version of Chat GPT). All that would be disseminated is the output of the tool’s determination—i.e. the feature assessments of any given chemical compound—and which output contains only facts, and nothing of the original expressive content, the part actually protected by copyright.
Indeed, there are hundreds of AI tools that scholars have trained and disseminated—tools that do not reproduce licensed content—and that scholars have created or fine-tuned to extract chemical information, recognize faces, decode conversations, infer character types, and so much more. Restrictive licensing language suppresses research inquiries and societal benefits that these tools make possible. It may also disproportionately affect the advancement of knowledge in or about developing countries, which may lack the resources to secure licenses or be forced to rely on open-source or poorly-coded public data—hindering journalism, language translation, and language preservation.
Lawfulness
In undertaking all of this work, these scholars are making reproductions of typically copyright-protected content to undertake their analyses. Reproduction is one of the exclusive rights that only copyright holders can exercise. So why can scholars reproduce other people’s work for their research processes? They have the right to do this because their use is a fair use, which is an exception to the exclusive rights granted copyright owners. We’ll address the legality of AI use and training in research in Part 2.