"

12 Text and Data Mining

Rachael Samberg and Katie Zimmerman

Desired result

Text and data mining (or TDM) refers to research methodologies that rely on computational tools, algorithms, and automated techniques to extract revelatory information from large sets of unstructured or thinly-structured digital content. Previous court cases have addressed fair use in the context of TDM and determined that the reproduction of copyrighted works to create and conduct TDM on a collection of copyright-protected works is a fair use. This means that a typical fair use savings clause should preserve Authorized Users’ existing rights to conduct TDM. Therefore, if you are concerned that you might create friction by negotiating specifically for TDM rights, you should trust that silence about TDM in the agreement does not preclude the right to conduct TDM according to the law.

However, not all publishers understand or agree with what the court rulings have determined. And when publishers detect that users are conducting automated downloads or processes with the licensed content, they may try to treat these acts as breaches of the license agreement and terminate campus access.

Thus, for the avoidance of doubt, it can be helpful to address TDM directly in the license to ensure that your users may undertake it. Indeed, some publishers wish to regulate TDM uses separately from fair use anyway, given that TDM practices involve the downloading or reproduction of large quantities of licensed content that could (if released to the world) jeopardize a publisher’s business model. So, it is valuable (and sometimes necessary) to include a specific TDM clause in an agreement in order to protect users’ rights to conduct TDM and share the results of their TDM analysis in some fashion.

Keep in mind, though, that a TDM clause (or separate TDM agreement) will then establish the specific scope of allowable TDM activities. So you need to be careful that these rights are not actually narrower than what fair use would have allowed your users to do. It’s a balancing act: If you have a TDM clause, you may lose something in terms of the breadth that fair use would have allowed. But it also provides certainty with respect to authorizing TDM (as measured in terms of actual and foreseeable user community needs) to avoid any dispute or doubt about the scope of what a publisher will permit.

Finally, you also need to consider whether any TDM definition or license is broad enough to encompass training artificial intelligence, if such an outcome is desirable for your institution. We provide tips on how to do that within a TDM clause below, but we also address artificial intelligence as a stand-alone concept and clause in a separate chapter.

What it means:

What is TDM?

Imagine you have a book like “Pride and Prejudice.” There are nearly infinite volumes of computable information stored inside that book, depending on your scholarly inquiry, such as how many female vs. male characters there are, what types of words the female characters use as opposed to the male characters, what types of behaviors the female characters display relative to the males, etc. TDM allows researchers to identify and analyze patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through on a close examination of one book or item at a time.

TDM is a fair use

Over the past decade, fair use has been interpreted by the courts[1] and the Copyright Office[2] to permit the reproduction of copyrighted works to create and mine a corpus of copyright-protected works. These authorities further hold that making derived data, results, abstractions, metadata, or analysis from the copyright-protected corpus available to the public is also fair use, as long as the research methodologies or data distribution processes do not re-express the underlying works to the public in a way that could supplant the consumer market for the originals.[3]

Using artificial intelligence to conduct TDM

For the same reasons that the TDM process is fair use of copyrighted works, the training of artificial intelligence (AI) tools (e.g. through a process called “machine learning”) to facilitate that TDM should also be fair use, in large part because training does not reproduce or communicate the underlying copyrighted works to the public. We say “should also be a fair use” because the courts and the Copyright Office are facing this issue right now.[4] For our part, we have encouraged the Copyright Office to protect the fair use rights of scholars and researchers to make these uses of copyright-protected works when training AI.[5].

Why are we bringing up AI?

It is helpful first to understand that not all TDM research methodologies necessitate the usage of AI systems to extract information. For instance, as in the “Pride and Prejudice” example above, sometimes TDM can be performed by developing algorithms to detect the frequency of certain words within a corpus, or to parse sentiments based on the proximity of various words to each other. In other cases, though, scholars must employ machine learning techniques to train AI models before the models can make a variety of assessments.

Here is an illustration of the distinction: Imagine a scholar wishes to assess the prevalence with which 20th century fiction authors write about notions of happiness. The scholar likely would compile a corpus of thousands or tens of thousands of works of fiction, and then run a search algorithm across the corpus to detect the occurrence or frequency of words like “happiness,” “joy,” “mirth,” “contentment,” and synonyms and variations thereof. But if a scholar instead wanted to establish the presence of fictional characters who embody or display characteristics of being happy, the scholar would need to employ discriminative modeling (a classification and regression technique) that can train AI to recognize the appearance of happiness by looking for recurring patterns in the indicia of character psychology, behavior, attitude, conversational tone, demeanor, appearance, and more. And to undertake this type of AI training, a scholar would need to use a large volume of licensed works.

So, if it is important on your campus that scholars be able to undertake machine learning or train AI as part of their TDM methodologies, then you’ll want to ensure that any TDM clause or language is broad enough to encompass these activities. References to “machine learning” and “computational analysis and modeling” can help encompass AI training activities.

When you need to specify particular access methods

It may also be helpful to specify how Authorized Users will gain access to the volume of content needed for TDM.  If this is left unspecified, then end users (or all of campus) may find themselves blocked when a researcher attempts to download large quantities of material for TDM use. Vendors who are supportive of TDM use in principle may still not have tools or workflows in place to enable access to large volumes of licensed material all at once (e.g. it could slow vendor services), and it can be helpful to provide language around what this process will look like.  In general it is most desirable for TDM use to not require direct mediation by the library or a vendor, but a TDM access method that requires facilitation by library staff may still be preferable to getting publisher approval for each TDM project. We provide language below with options for several different common methods of TDM access.

Addressing post-TDM research project data retention

For e-journal and e-book content it likely is not necessary to separately specify data retention for content acquired by Authorized Users for TDM, particularly if there is already a perpetual access clause in the license.  Some vendors, however, are more sensitive about retention of licensed content used for TDM because of the volumes of downloads involved. This can be particularly challenging with business databases and other vendors who are not primarily engaged with the academic market. Some vendors will, therefore, try to require that the content acquired for TDM be deleted after a specified period of time, or after the completion of a specific TDM project. This is not viable in academic research for several reasons.

  • First, most academic projects do not have a discrete start and end – one “project” will result in multiple publications, the result of one analysis will suggest the next which will require the same data, etc.  Vendors who more frequently interact with business or industry may be assuming that the data can be downloaded and used towards a deliverable and then will no longer be needed, which is not how academic research progresses.
  • Perhaps more importantly, researchers also need to maintain datasets for purposes of replication and validation of their results. Reviewers and other researchers will need to be able to determine that the research methodology used in the study is valid and that results are accurate, which generally requires access to the source data. Depending on the research needs this source data may not need to be fully public – it could, for example, consist of metadata and relevant snippets – or it may not need to be fully public.[6] Generally, however, a full copy of the dataset needs to be retained in some manner.

We provide language below to clarify that datasets downloaded for TDM research may be retained as needed for the scope of the project.

Desired language:

For a TDM clause included in the “main” agreement

[CDL Model Language]

Text and Data Mining. Authorized Users may use the Licensed Materials to perform and engage in text and/or data mining activities for academic research, scholarship, and other educational purposes and may utilize and share the results of text and/or data mining in their scholarly work and make the results available for use by others, so long as the purpose is not to create a product for use by third parties that would substitute for the Licensed Materials. Licensor will, upon receipt of written request, cooperate with Licensee and Authorized Users as reasonably necessary in making the Licensed Materials available in a manner and form most useful to the Authorized User.  Licensor shall provide to Licensee, upon request, copies of the Licensed Materials for text and data mining purposes without any extra fees.

[Use the following, if the last sentence is not accepted.]

If Licensee or Authorized Users request the Licensor to deliver or otherwise prepare copies of the Licensed Materials for text and data mining purposes, any fees charged by Licensor shall be solely for preparing and delivering such copies on a time and materials basis.

For TDM rights negotiated separately or as an amendment

DEFINITIONS

“Authorized Users” include full and part time employees (including faculty, staff, and independent contractors) and students of Licensee, regardless of the physical location of such persons. Authorized Users also includes patrons not affiliated with Licensee who are physically present at Licensee’s site(s) (“walk-ins”).

“Licensed Materials” are the materials identified in Appendix A subject to this TDM License.

“Text and Data Mining” or “TDM” means to download, extract, analyze, classify, model, or index the Licensed Materials, or information from the Licensed Materials, using computational tools, algorithms, machine learning, artificial intelligence, or automated techniques.

“TDM Outputs” mean the result(s) of any TDM activity or operation, capable of fixation, reproduction and/or communication in any form. This may include but is not limited to: the creation of an index, reference, abstract, description, model, or representation of the Licensed Materials; an algorithm, formula, metric, method, standard, or taxonomy describing or based on the Licensed Materials; a relational expression or measurement of the Licensed Materials; or an extraction, representation, expression, or discussion of any extracts from Licensed Materials upon which TDM has been performed.

GRANT OF LICENSE: Licensee and Authorized Users may conduct TDM on the Licensed Materials for non-profit scholarly, research, or educational purposes. Licensee and Authorized Users may utilize and share the TDM Outputs, or the analysis or derived data from conducting TDM, in their scholarly work and make such TDM Outputs, analysis, or results available for use by others, except to the extent that doing so would substantially reproduce or redistribute the original Licensed Materials for third parties, or create a product for use by third parties that would substitute for the Licensed Materials.

LIMITATIONS ON LICENSEE: Unless otherwise provided for in writing by Licensor, the Licensee and Authorized Users shall:

  1. Use commercially reasonable information security standards to undertake TDM, and to mount, load, or integrate TDM Outputs on Licensee’s or Authorized Users’ servers or equipment.

  2. Refrain from creating a competing commercial product or service for use by third parties.

  3. Refrain from conducting TDM in a way that unreasonably disrupts the functionality of the Licensed Materials, or substantially interferes with Licensor’s ability to provide the Licensed Materials to customers.

When you need to specify particular access methods for TDM

Here are some optional clauses from which you could choose to convey an access method. (Choose only one.)

Authorized Users shall have access to an API provided by Licensor.  API documentation can be found at [link] and Licensor shall provide reasonable customer service support for API users.

Authorized Users may download or extract information from the Licensed Content, by manual or automated means, for TDM through Licensor’s online interface [specify here, e.g. licensor.com]. [Optional, or as negotiated: Licensee will inform Licensor of TDM downloading no less than twenty-four hours in advance. / Automated downloading of Licensed Content for TDM shall not exceed a rate of [negotiated rate, e.g. one download per second].]

Upon request, Licensor shall provide copies of Licensed Content for TDM.  Licensee shall provide sufficient information to identify the Licensed Content required for TDM, and Licensor shall use commercially reasonable efforts to fill the request promptly and in a mutually agreed standard file format.

Licensee and Licensor shall in good faith mutually determine the method of TDM access on a case-by-case basis.

When you need to specify data retention

It is mutually understood that Licensed Content provided under this clause may be retained by Authorized Users throughout the lifecycle of the TDM project and as necessary for replication and validation of research results. Licensed Content retained under this clause shall remain subject to the terms of this Agreement.

Tricks and Traps:

Be mindful of narrowness vis-a-vis fair use savings clauses

We address further in the Appendix the fact that more specific language typically controls over general language in a contract in the event of a conflict between terms[7]. The upshot of this rule means that if you have both a fair use savings clause and a TDM clause, the TDM clause will be what establishes the scope of allowable TDM activities.

Here’s an example of what we mean: Say you have an agreement that provides that “nothing in this agreement shall be interpreted to restrict fair use rights.” But then you also have a TDM clause that provides, “Licensed content may be downloaded and analyzed using automated processes but results of TDM may not be shared,” or “Licensed content may be analyzed using automated processes, but the results may not be stored.” Those TDM terms are (arguably) more restrictive than what a researcher would have otherwise been able to do if only the general fair use clause were present. A scholar would have been able to download, store, and analyze licensed materials under fair use, but now can’t under this more restrictive TDM provision.

Because more specific terms control over the general on a given topic, it’s important to exert care so that the TDM rights you negotiate are equally as broad—and if needed, even broader—than what fair use would have allowed if that is important to your users.

For a TDM clause included in the “main” agreement

If you can include a TDM clause in the main license agreement, it should provide for Authorized Users to:

  • Conduct TDM for research, scholarship, or other educational purposes. You need not define or limit what constitutes TDM, however. This leaves room for developments in TDM research methodologies, such as scholars using machine learning to train algorithms to conduct the TDM.
  • Share and make available the results of TDM (or abstractions, analysis, or derived data from the results) so long as doing so wouldn’t substitute for the licensed materials or create a competing/commercial product.

And ideally, you could achieve both of the above outcomes without the payment of additional fees to the publishers.

For TDM rights negotiated separately or as an amendment

Some publishers prefer stand-alone TDM agreements or amendments, and in doing so may endeavor to impose stringent limitations on what TDM is defined as, and what can be performed or shared as part of the TDM process.

In these situations, try to ensure that these stand-alone TDM agreements sufficiently:

  • Define TDM and TDM processes in a way that covers the full range of TDM activities sought to be undertaken by campus users—including machine learning and artificial intelligence training if relevant.
  • Do not unduly limit particular TDM acts otherwise protected by fair use;
  • Encompass mechanical or logistical processes that align with how researchers undertake TDM. Publishers may impose or require that TDM be conducted using the publisher’s application programming interface (API). If so, you should review that the limitations of the API do not diminish its utility; and
  • Permit users to utilize and share the TDM outputs or results, again to the extent doing so wouldn’t substantially redistribute the underlying Licensed Materials, or create a competing product.

“Robots” and crawlers

Publishers sometimes try to prohibit automatic downloading of content, which could have outsized impacts on TDM. You can aim for a middle ground with language like:

Robots, spiders, crawlers or other automated downloading programs, tools, or devices to search, scrape, extract, deep link, or index the Subscribed Products may be used only to the extent reasonably necessary to conduct the TDM.

Getting support

It may help to get your institution’s faculty on board both to understand the importance of preserving TDM rights, and to get their support on record for your negotiations. An explainer like this one [8] from the University of California can help them understand that if they want to be able to conduct the research they desire using TDM (and AI), these rights must be preserved. In addition, your faculty senate or university president may wish to consider issuing a statement in support of rights preservation, which can help convey to publishers that the support of the university is behind you. You can check out this example from University of California’s Academic Senate [9], which was then affirmed by the UC President and Provost [10].

Importance and risk:

TDM methodologies (and the use of artificial intelligence in undertaking them) may not be essential to the research activities conducted at your institution. And a standard fair use savings clause should be sufficient to preserve your users’ rights to undertake TDM regardless. However, if you know that your eResources will indeed be used for TDM and/or in conjunction with AI, then it is advisable to avoid potential dispute with the publisher about whether the TDM (and AI) is permitted by addressing TDM and AI directly in a TDM clause or within a separate TDM agreement. If you don’t, you risk the publisher treating automated TDM acts as breach, and having the publisher terminate a user’s—or the entire campus’—access to the resources.

 


  1. See Authors Guild v. Google, Inc., 804 F.3d 202, 215 (2d Cir. 2015); Authors Guild, Inc. v. HathiTrust 755 F.3d 87, 105 (2d Cir. 2014); A.V. ex rel. Vanderhye v. iParadigms, LLC, 562 F.3d 630 (4th Cir. 2009)
  2. In evaluating the proposed DMCA § 1201 exemption to circumvent technological protection measures on DVDs and eBooks for the purpose of conducting TDM, the USCO writes: “Balancing the four fair use factors, with the limitations discussed, the Register concludes that the proposed use is likely to be a fair use.” See U.S. Copyright Office. Section 1201 Rulemaking: Eighth Triennial Proceeding to Determine Exemptions to the Prohibition on Circumvention – Recommendation of the Register of Copyrights – October 2021. https://cdn.loc.gov/copyright/1201/2021/2021_Section_1201_Registers_Recommendation.pdf
  3. The findings in these matters also reinforce that copying done as part of a process to produce non-expressive or non-infringing content (such as patterns or data) is not an infringement in part because copyright protection does not extend to facts or ideas. Google, 141 S. Ct. at 1187.
  4. For the Copyright Office study, see https://www.federalregister.gov/documents/2023/08/30/2023-18624/artificial-intelligence-and-copyright. For a summary of court cases, see https://chatgptiseatingtheworld.com/
  5. https://www.regulations.gov/comment/COLC-2023-0006-8194
  6. Although ideally it would be. See, e.g. Hussey, I. (2023, May 8). Data is not available upon request. https://doi.org/10.31234/osf.io/jbu9r.
  7. 17A C.J.S. Contracts § 433; 11 Williston on Contracts § 32:10 (4th ed.)
  8. https://osc.universityofcalifornia.edu/2024/03/fair-use-tdm-ai-restrictive-agreements/
  9. https://senate.universityofcalifornia.edu/_files/reports/js-ac-statement-on-ai-and-licensing.pdf
  10. https://ucnet.universityofcalifornia.edu/employee-news/president-drake-and-provost-newman-affirm-the-universitys-commitment-to-protect-author-researcher-and-reader-rights/

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

E-Resource Licensing Explained Copyright © 2024 by Sandra Enimil, Rachael Samberg, Samantha Teremi, Katie Zimmerman, Erik Limpitlaw is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.