10 Reflections

Rachael Samberg and Timothy Vollmer

In this chapter, we cover the pedagogical takeaways from the four-day Institute (held June 2020), and reflect on the lasting impact it made eight months later, as evidenced by our observations from the plenary post-Institute check-in.

Design thinking is effective for teaching LLTDM

Participants felt empowered after the Institute to understand the basic contours of the legal literacies for text data mining and applying them to their own work, whether that be developing their own TDM projects, advising DH researchers, or working with TDM issues in libraries and archives. The participants’ own words say it best:

  • “I can say with confidence that I understand the four literacies better”
  • “I really feel that I am coming out with much more both theoretical and practical knowledge than I expected.”
  • “I will be much more intentional at the outset of any TDM project about working through all of the pertinent literacies in a systematic way…the way the Institute was structured into different literacies provides a repeatable framework to treat potential problems prospectively.”
  • “I am taking home a lot of new insights from this Institute in combination with a feeling of empowerment that will allow me to reach out to the specialists and directors at my institutions in order to push for more TDM collaboration and a bolder approach concerning materials and datasets for international cooperation. I know now what the important legal issues are and how to use them to form my arguments and that is more than I could have wished for. Also, the Institute broadened my perspective with regards to issues that I did not have on the radar that much at the beginning and I am looking forward to engaging with these topics in the future, to integrate them into my teaching, and to advocate for them where I can.”

The pivot from our initial plan to host an in-person Institute to a virtual one was met with applause. In particular, the participants valued the interactive format fostered by the design thinking model, with different touch points and small group discussions. Again, in their own words:

  • “The deliberately thought through breakdown and mix fostered incredibly valuable discussions and I would hope this kind of framework is used as a best practice for future DH institutes of all kinds going forward. Also, thank you for such an amazing virtual experience which I can only imagine took a tremendous amount of work to coordinate and plan with limited time to shift to an entirely different format–I was overjoyed to critically engage with complex subjects and for the chance to get out of my everyday pandemic routines.”
  • “I found this to be the best example of how to manage hands-on learning in a virtual environment. I think the planning team did a FANTASTIC JOB pivoting to a fully online environment without losing the feel of an in-person intensive.”
  • “The multi-modal communication (Slack, Mural, Zoom) enabled far more interaction than I anticipated.”
  • “This is by far the best organized event that I have ever attended. The content was by far the most substantive. The faculty were by far the most engaged. A+ across the board.”
  • “The flipped learning approach, combined with design learning elements, really worked well. The lecture/video materials and reading in particular were well presented and selected, and I really appreciated that we could do that at our own pace. The overall topic of this gathering was well chosen in that it could allow for us to do focused seeking of answers to questions but in a way that had real practical consequences for how we could change the world of TDM research.

We are hopeful that the literacies and methodology developed and shared by the Institute will find a place in broader DH curricula and empower DH researchers to build and analyze their text corpora without fear, thanks to their being more secure in their knowledge of the law.

Lessons for the instructors

The conversations during the Institute and the participant feedback gave us much food for thought. We’d like to expand our commitment to diversity and ensure that the demographics of both faculty and participants reflect those of the broader population, and that the kinds of questions and examples that animate our discussions engage with issues of ethics, equity, and representation.

As we repurpose the Institute training and materials in the future, we will also consider additional ways to emphasize and create discussions around ethics, and perhaps foreground ethics as the first step when thinking through DH projects. We believe questions of ethics loomed large not only because of the critical importance of ethics when addressing data at scale, but also because of the relative absence of guidelines and best practices to help guide us in this area.

We also learned a few specific things that may shape how we approach immersive LLTDM trainings in the future:

Copyright isn’t a sticking point (or even that intimidating!)

Questions about using material still under copyright were at the forefront of participants’ minds when they entered the Institute, but those concerns evaporated quickly. The copyright portion of the curriculum addressed copyright and the fair use exception extensively. Among others, we discussed the Google Books case, which established that running algorithmic analyses on text was transformative and that using the entirety of books in its corpus was necessary. (One of the authors of a widely-cited amicus brief in the Authors’ Guild v. Google Books and HathiTrust cases was a member of our faculty.) We discussed risk and risk tolerance. Unexpectedly to many, copyright issues turned out to be relatively straightforward, and participants felt empowered to perform analyses on copyrighted materials. One participant said, “I also feel compelled now to do my own research and take advantage of the expansive idea of fair use to examine contemporary, creative works,” and another “was mainly relieved that my TDM project was transformative enough to not violate copyright.” Rather, the sticking point was how to educate our communities in the possibilities that fair use might allow.

Building a corpus is tough!

Our pre-Institute research and experience indicated that researchers may choose frictionless materials for their corpora, such as materials already in the public domain, or, if they use materials under copyright, they may be unwilling to reveal the process by which they acquired those materials. The former limits the kinds of questions that can be asked, makes certain time periods easier to study, and may result in bias. The latter makes reproducibility difficult.

The experiences of the participants in the Institute indeed confirmed these challenges. Participants shared their frustrations with finding content and their discomfort with using materials that were under copyright or licensing restrictions. Such challenges limited their work and constituted a major roadblock to their research, one that sometimes exceeded even the technical difficulties of doing the analysis itself. Participants weren’t always comfortable sharing how they acquired those materials.

Weave literacies into projects

Another lesson that came up repeatedly was that: We should be building a legal literacies workflow into DH project planning from the very beginning, and refer to it throughout the project lifecycle. Too often, copyright and other legal considerations are unchallenged or brushed aside, to the detriment of our work. This is partly owing to a lack of expertise in these areas or to fear of reprisal. Institute participants suggested ways of addressing these considerations, from trainings, to online documentation, to building legal questions into the project management process for DH work. One participant said, “In our library’s center for digital scholarship, we need to develop a better charter/MOU/agreement system for digital projects that will at least touch on data management (DMPs), legal implications (copyright, etc), collaborator expectations, and ethics.”

International issues need future institutes

Although we had initially intended to focus mainly on US law, in the end we realized that international issues are unavoidable given the broad range of humanities research our cohort represented: either scholars are working with materials published under different legal frameworks, or are collaborating with others working in those environments. This obviously complicates the legal picture, so rather than offering clear answers to every question (many of which simply aren’t clear), we offered strategies for assessing and mitigating risk. At the same time, we did offer a high-level view of copyright regimes around the world that seemed to be appreciated. Cross-border research collaborations emerged as a clear example of follow-on training that we believe is necessary.

TDM-friendly licenses

Sometimes licenses with publishers, vendors, museums, and other content providers can further restrict uses that would otherwise be allowed under copyright law. While licensing restrictions can be frustrating when terms stand in the way of assembling corpora and running analyses on them, participants learned what a TDM-friendly license might look like, such as one with terms that specifically allow for TDM uses or that contain a fair use clause. The California Digital Library’s model license was shared as an example. Licensing was revealed to be an area with the potential for participants to directly intervene in through education, advocacy, and negotiation.

Ethics front and center

Ethics emerged as a major focus of concern for participants in the Institute. Indeed, we quickly realized that although we discussed ethics last, it was difficult to even begin thinking about copyright, licensing, and other legal issues before ethical considerations were addressed, especially given the Institute’s care for questions of social justice. A preferred workflow that emerged for the Institute participants might foreground ethical concerns before moving onto other literacies.

While participants entered the Institute focused on questions of copyright, many reported leaving with their copyright questions solved and their ethical questions awakened. As one participant wrote, the Institute “erased my anxieties in target areas and introduced whole new considerations in areas like ethics. It answered my questions and left me thinking.”

Unlike the other literacies, ethics must often be navigated without reliance on the law or clear guidelines. Even IRB guidelines may not always help, particularly as many TDM projects do not have “subjects” in the way that traditional surveys and studies do. Instead, researchers may need to turn to community expectations, other specialists, or disciplinary principles. Sometimes, there may not be any guidance at all, and few solid models for ethics in TDM research are available. In many cases, it will be up to the researchers to determine their own best practices for considering ethics.

One model that resonated with the group was an Ethics of Care approach, which takes into account the relationships between research participants and acknowledges structures of power. Ethics of Care offers an alternative to an individualist consent-based ethical model. In TDM contexts, consent may not always be available or scalable, or the kinds of implied consent (for example, individuals publishing posts to Twitter) may not satisfy the ethical standards of researchers.

Overall, the participants left energized to continue this conversation and contribute to developing ethics models that might guide TDM researchers in the future.

Impact, eight-months on

We analyzed participant update videos and observed not only the lasting impact of the LLTDM literacies, but also a persistent sense of community (or at a minimum, shared experience).

Confidence abounds

One of the themes that arose back in June was the pervasive feeling of imposter syndrome among participants. It seems to permeate this work, perhaps because as one participant so rightly observed, no one person can possibly be a deep expert across an entire landscape of issues in text data mining, from corpus building and computation to legal and ethic issues and all of the many technical, intellectual, and labor issues that underpin the work. But no one mentioned feeling like an imposter in their update videos. Instead we heard about how much more confident they felt integrating the literacies into their work. And this has taken a lot of forms from licensing negotiations to establishing best practices in their labs. The biggest struggle moved from not knowing what to do to finding the time to do it.

Ethics of care

Our closing reflections from the Institute June included strong advocacy for taking an ethics-first approach to teaching the literacies and implementing text data mining projects. It was heartening to see the many ways that participants are living these values by structuring ethics as a key component of their work:

  • One scholar added a dedicated ethics section to a paper she submitted that involved the use of YouTube data.
  • Another centered ethics in her application of the literacies to a racial reckoning project at her home institution.
  • A librarian has adjusted consultations with researchers to take an ethics first approach.
  • A faculty member has shifted toward an ethics of care framework in working with students in the classroom and in his research lab.
  • Several participants developed workshops and related materials that focus on ethical considerations when doing this work.

They also turned an eye toward institutional gaps where ethics are concerned. One update reflected on the lack of oversight of privacy and ethical issues in TDM research and the need for structures and education that will help with that intervention within our institutions.


Across our institutions expertise is both shared and distributed. It would be exceedingly rare to find any one person or even any one office prepared to address the technical, legal, ethical, and logistical nuances of text data mining. Several participants mentioned that it’s difficult to build community due in large part to the nature of the work. And living and working through a global pandemic certainly hasn’t made that any easier!

Some participants nevertheless made some real gains in community building, and we’d like to celebrate that. One participant described how they initiated conversations across their institution about text data mining to start thinking at an organizational level, and they also noted that they had formed relationships with the sponsored research office and with the faculty working group on data science. Another participant has taken up the idea of the Data Ombudsperson and is working to introduce it to the scholarly communication group at their library. Yet another participant has established a new research cluster on Critical Practice in Text Data Mining under the auspices of their humanities research center. These kinds of connections hold the potential to make real forward progress within institutions that are notoriously complex.

Institutional risk aversion

One participant described institutional conservatism and risk aversion as their ongoing struggle. And another had hoped to push their institution to be bolder and braver, but it wasn’t as easy as they had hoped. Seeding institutional change is long durational work and it begins with small acts of relationship building. It’s really important to celebrate these gains while striving for much bigger shifts in practice and perception.


One of the most striking things we noticed while watching the update videos was participants’ clever use of forms and documentation as tools to help kick start conversations that can ultimately shape practice. One participant described developing an MOU template for use in the digital scholarship lab that includes a section on the legal and ethical implications of the work. The template helps foreground these issues during the negotiation and ensures that they are addressed in the final agreement. In a similar vein, another participant has been developing a rubric for designing new digital projects that incorporates the literacies and is grounded in the insight that it is best to begin by planning for the end. This presumably helps front load conversations not just about data collection and corpus building but also representation and distribution for publication and long term preservation. To socialize these practices with graduate students, another participant has started requiring a data management plan for student research projects conducted as part of his research lab to ensure everyone in the lab is thinking deeply about ethics in data collection, dehydration, and eventual destruction for social media research. This approach simultaneously generates deep and thoughtful conversations while also making them expected and routine.


Several participants have been working to break up their institution’s licensing routines with various approaches to address TDM—or not. One participant has been looking at the possibility of regularly including TDM language in institutional licenses, which is in keeping with the approach taken in the California Digital Library’s model license agreement. Another participant started working on licensing terms and setting up contracts with vendors at their institution, they ultimately preferred the use of a “Fair Use Escape Clause” rather than outlining specific terms for TDM. They discovered that in an attempt to be explicit, the terms that vendors found acceptable were too confining.

Another piece of the licensing puzzle is making the negotiated terms legible to researchers. One participant has been taking that on with a database evaluation to outline who is eligible to use each resource, how the data may be used, and what content is available. Even when full licenses aren’t readily shared with the campus community, this kind of matrix can do a lot of work to help users assess their options when working with content licensed through the libraries.


Another way participants have been working with your local communities is by integrating the literacies into your workshops and courses. One participant conducted an hour and a half workshop and has already shared her materials online for those of you who are seeking models for your own efforts on campus. Two other participants collaborated on a workshop foregrounding privacy and ethics in DH projects, which is also available online. And yet another participant has put together a suite of relevant workshops associated with their research cluster.

One challenging thing that came up in an exchange between a participant and a faculty member was the fact that teaching copyright can lead to a lot of fear, uncertainty, and doubt, even when the intention is to empower people to understand their rights. It would be helpful to discuss potential strategies for mitigating that effect as part of our ongoing conversations.

Corpus building

An area where teaching and research appear to intersect is corpus building, and several participants have been applying the lessons from the Institute to your own corpora. One participant has amassed 18,000 YA novels as part of a comparison dataset for use with a digital scholarship project and has also been working to create a standard corpus for each language program in their department so that graduate students have uniform access to a shared dataset right from the beginning of their studies. Another participant has been looking to expand their use of text datasets in their own teaching and has expressed interest in building out a “Law in Literature” text dataset to that end. A third participant has been working on a corpus-building work around that focuses on helping users run queries that return URLS which can then be downloaded to personal machines. This strategy allows an institution to facilitate TDM while pushing the legal burden to the end user.


Icon for the CC0 (Creative Commons Zero) license

To the extent possible under law, Rachael Samberg and Timothy Vollmer have waived all copyright and related or neighboring rights to Building Legal Literacies for Text Data Mining, except where otherwise noted.

Share This Book