Sometimes the works you would like to analyze using text data mining tools are already available in a high-quality digital form. You may be able to get what you need in e-books acquired from Amazon, or journal articles downloaded from a publisher’s website, or you might simply be able to scrape user generated content from a social media site. Or, even better, perhaps someone else has already done this work and is happy to share their source materials.
These modes of acquisition all sound very promising, but they raise a new set of questions that take us beyond the parameters of traditional copyright law. Some of the actions I have just described might involve circumventing technological protection measures or possibly illegally gaining unauthorized access to someone else’s computer.
In the following sections, we are going to take a look at the issues raised by the anti-circumvention provisions of a law known as the Digital Millennium Copyright Act and the application of the Computer Fraud and Abuse Act and similar “anti-hacking” laws.
The problem of digital locks
What are technological protection measures and digital rights management?
Works in digital form may be protected by technological protection measures (often just called, “TPMs”) that control access to copyrighted works. These technological protection measures are also referred to as digital rights management (or “DRM”). We will use the terms TPM and DRM interchangeably here, but the simplest way to think about them is as digital locks. Like physical locks, digital locks can be used to control access to a thing or to limit what can be done with it.
Such digital locks are a potential problem for text data mining initiatives because often the cleanest and simplest way to build a corpus is to get access to authorized copies of the original works in digital form.
In the world of books, for example, cracking the encryption on an ebook sold by Amazon would give the researcher access to a much cleaner copy than could be achieved through OCR (optical character recognition). This mode of acquisition is also preferable in some cases because it overcomes coverage limitations in existing repositories. For those of you working with large volumes of audiovisual material, defeating encryption may be the only option to get content into a text mining database that wouldn’t take decades.
Breaking digital locks is generally illegal in the United States
However, in spite of its attractions, building a research corpus by breaking DRM has at least one very significant disadvantage, in the United States at least, it’s illegal.
(a) The anti-circumvention provisions of the DMCA
In 1998 Congress added some special provisions to the Copyright Act which made breaking digital locks that protect copyrighted works a civil, and potentially also a criminal, offense.
These “anti-circumvention rules” apply separate and independent of any underlying copyright infringement. In the Digital Millennium Copyright Act, Congress added section 1201 to the Copyright Act. Section 1201 prohibits the circumvention of technological measures that restrict access to, or copying of, copyrighted works. It also prohibits the creation or distribution of tools that facilitate circumvention. The various parts of Section 1201 are generally referred to together as the “anti-circumvention” provisions of the DMCA. The DMCA creates civil remedies and criminal sanctions for violations of the anti-circumvention provisions.
(b) There is no fair use exemption under the anti-circumvention laws in the United States
The hardest thing to accept about the anti-circumvention provisions of the DMCA is that they make breaking digital locks illegal, even when the copying/access that this allows would be covered by the fair use doctrine.
Arguably, this shouldn’t be the case, but to date, courts in the United States have not been convinced. Thus, although the anti-circumvention provisions of the DMCA were not intended to limit or restrict fair use, courts have not treated fair use as a defense to the anti-circumvention provisions either.
This means that although copying e-books for the purpose of text data mining research would be protected by the fair use doctrine, breaking the DRM on those e-books to make that copying possible would still be unlawful.
(c) Possible future exemptions to the DMCA
The DMCA contains exceptions for reverse engineering and encryption research, but there are no similar provisions for text mining. This may change. The Copyright Act authorizes an administrative procedure whereby the Librarian of Congress may grant temporary, three-year exemptions to the DMCA anti-circumvention rules.
At the time of recording, a group based in the Samuelson Law, Technology & Public Policy Clinic at UC Berkeley is currently pursuing this, but they have a lot of work to do. To make the case for a text mining exception they will have to show that the underlying use is non-infringing, that the absence of an exemption adversely affects users or is likely to do so in the near future.
(d) Text data mining by research organizations and cultural heritage institutions appears to be exempt from anti-circumvention rules in the European Union.
In April 2019, the European Union adopted the Digital Single Market Directive (“DSM Directive”) featuring two mandatory exceptions for text and data mining. EU members have until June 7, 2021 to implement the directive in national legislation and our current assessment of the impact of the EU directive may change once we see exactly how that implementation proceeds.
It appears that the mandatory exception for text data mining by “research organisations and cultural heritage institutions” under Article 3 of the EU Digital Single Market Directive (“DSM Directive”) seems to preempt otherwise applicable anti-circumvention laws, and also overrides contract or license terms that otherwise would restrict the ability to circumvent digital locks.
Individuals and organizations relying on the narrower exemption under Article 4 — i.e., anyone who is not a “non-profit educational institution or cultural heritage institution” — remain subject to European anti-circumvention laws and do not get the benefit of contractual override.
But note, we have yet to see how the members of the EU plan to implement the DSM Directive, so the analysis above is preliminary.
Researchers in the United States need to make their own assessment as to whether the risks of potential civil and criminal penalties under the DMCA for violating the anti-circumvention rules are worth the rewards. We are aware that this practice is relatively common and that in many contexts the chances of enforcement action being taken are fairly low, but we are not in a position to recommend it.
Dealing with liberated works
Is DRM an issue for those who receive unlawfully “liberated” copies of works that were once protected by DRM?
(a) Lawful access in the United States
Many TDM researchers face the issue of whether they should take advantage of access to copyrighted works that have been initially copied illegally, or have had their digital locks broken in violation of the applicable rules under the DMCA.
There is no United States case law directly on point and none of the precedents confirming the fair use status of reproduction for the purpose of TDM suggests that lawful access is a precondition to fair use. Consequently, we can only address this difficult question by reasoning from first principles.
The overwhelming weight of authority rejects any notion that lawful access is an absolute per se precondition to fair use, and the more persuasive view is that the question of whether the work was subject to prior unlawful acts by third parties is irrelevant to the fair use analysis.
Furthermore, although there is mixed authority on the question, it is doubtful that the defendant’s own morality and propriety should influence the question of fair use. The fair use doctrine does not come down to questions of individual moral or artistic virtue, it defines the outer boundary of copyright protection. Case law suggesting that fair use is presupposed on “good faith” conflates the fair use doctrine with the rules developed by English courts of equity but this is erroneous. The fair use doctrine began as a matter of statutory interpretation, not an equitable doctrine. Thus, although it is not beyond argument, the better view is that “a user’s good faith is irrelevant to the fair use analysis.” Moreover, even if good faith is relevant in some circumstances, we believe that (1) it would be simplistic to equate good faith to access to a legally made copy and (2) even if good faith is relevant under some circumstances, it likely has no real significance in the face of an otherwise compelling fair use argument.
However, caution is warranted. It is entirely plausible that US courts will be influenced by the prevalence of a lawful access precondition to the right to engage in TDM research in other jurisdictions (see below) and adopt that requirement here.
(b) Lawful access to the work in the EU and elsewhere
Article 3 of the DSM is limited to “reproductions and extractions … of works or other subject matter to which they have lawful access;” Article 4 is likewise limited to “reproductions and extractions of lawfully accessible works and other subject matter.” There is some ambiguity about the scope of this requirement, but it seems likely that otherwise lawful text data mining would be rendered unlawful in Europe if the source material was copied illegally. At this point, we can only speculate as to how the lawful access requirement is meant to interact with the provision in Article 3 that appears to preempt otherwise applicable anti-circumvention laws.
It’s also worth noting that several other jurisdictions have adopted a similar “lawful access” requirement.
(c) Risk assessment and mitigation
This is an area where the applicable law may change and where specific factual permutations may be highly relevant. We recommend seeking advice before relying on these materials to design a research program. With those caveats in place, we believe that the better view, on balance, is that the fact that a third party illegally copied a work, or illegally circumvented a technological protection measure relating to the work should not alter the fair use analysis. We also believe for similar reasons that even if a researcher unlawfully bypassed DRM herself, that should not affect the fair use analysis. However, because there is no US authority directly on point, we can only express a moderate level of confidence about these conclusions and we should note that US courts may be influenced by the law in other jurisdictions that goes in the opposite direction.
In terms of a hierarchy of risk, we think that the risk of merely obtaining works from a third party who broke DRM is low; and that the risk of obtaining works from a third party who obtained them in violation of the copyright owner’s exclusive rights is moderate. In contrast, the risk of breaking DRM oneself (or encouraging someone to do it for you) is moderate in terms of how it might affect the fair use analysis, but relatively high in terms of potential liability under the anti-circumvention provisions of the DMCA.
In the European Union (and other countries outside the U.S.) the hierarchy of risk may be different. Researchers who benefit from Article 3 may be able to break DRM without liability (depending on how the DSM Directive is implemented), but it’s unclear whether obtaining a work from a third party who had broken DRM would be regarded as violating the “lawful access requirement.” Researchers in Europe would also be prohibited from conducting text mining research using source material that had been copied illegally.
Liability under anti-hacking laws for violating terms of service
Text data mining researchers often want to analyze texts and other primary materials that are available online in one sense, but are not necessarily “available” to them, or at least, not for the purpose of text data mining. We might be talking about journal articles hosted by commercial publishers, or social media content hosted by Facebook, classified ads hosted by Craigslist, or even company press releases on a corporate website.
This content may be hidden behind a paywall, not shared at all, or access may be subject to terms and conditions that do not permit text data mining. The basic contract law issues in this scenario have been/will be dealt with elsewhere, but in addition to those issues, researchers also need some familiarity with anti-hacking laws such as the Computer Fraud and Abuse Act (“CFAA”).
These laws make it illegal to “access” someone else’s computer system without authorization. I’m sure that we can all imagine some scenarios where access is clearly authorized, or clearly unauthorized, but there is a substantial gray area in between that we need to address.
Websites protected by a password, a paywall or similar devices
The Computer Fraud and Abuse Act (or “CFAA”) is a pre-Internet law aimed at preventing computer hacking. The CFAA has been around for a while, but there is still some ambiguity about the scope of conduct it prohibits. As the Supreme Court has explained, the statute “provides two ways of committing the crime of improperly accessing a protected computer: (1) obtaining access without authorization; and (2) obtaining access with authorization but then using that access improperly.”
Let’s start with something simple: accessing a password-protected computer system without authorization, or when authorization has been specifically revoked, violates the CFAA.
Working around authentication controls or permission requirements (such as usernames and passwords), using stolen usernames and passwords, or somehow defeating payment requirements, are all examples of conduct that would violate the CFAA in most circumstances.
Such conduct should be strictly avoided.
There is a distinction between violating terms and conditions and computer hacking
Most courts recognize that there is a critical distinction between the violating terms and conditions of access and accessing a computer system without authorization.
Whether merely violating conditions of access to a computer system that is not open to the public triggers CFAA liability is a matter of contention. The better view, adopted in the Ninth Circuit and the Fourth Circuit, is that it does not. However, the First, Seventh and Eleventh, take a broader view of what it means to “exceed authorized access” under the CFAA.
The difference largely comes down to whether the court sees the CFAA as an anti-intrusion statute, or embraces a more expansive contract-based interpretation of the CFAA’s “without authorization” provisions.
The emerging consensus appears to favor interpreting the CFAA as an anti-intrusion statute. This interpretation is particularly favored in cases where the computer system is available to the public at large without registration or password protection.
In the recent case of hiQ Labs, Inc. v. LinkedIn Corp., the Ninth Circuit court of appeals held that accessing a computer system that is available to the public at large does not trigger liability under the CFAA, even if permission to access has been specifically revoked. The Ninth Circuit reasoned that “the CFAA is best understood as an anti-intrusion statute and not as a misappropriation statute,” and thus obtaining information by scraping that was “available to anyone with a web browser” fell outside the scope of the CFAA.
In early 2020 the District Court for the District of Columbia addressed potential liability under the CFAA in a research context. The court in Sandvig v. Barr held that accessing online hiring websites for the purpose of conducting academic research would not violate the access provisions of the CFAA, even though such access would clearly violate the websites’ terms of service.
The researchers were conducting audit testing on employment websites by submitting fake resumes in order to determine whether the algorithms used by the websites were racially biased. This deception clearly violated the applicable terms of service. Nonetheless, the court concluded that “the CFAA does not criminalize mere terms-of-service violations on consumer websites and, thus, that plaintiffs’ proposed research plans are not criminal under the CFAA.”
At the time of recording, the US Supreme Court had agreed to hear a case addressing these issues, but the hearing date has not yet been set.
To avoid civil and criminal liability under the CFAA, researchers should not defeat access controls to non-public computer systems.
Researchers in the First, Seventh and Eleventh Circuits (i.e. the states of Maine, Massachusetts, New Hampshire, Puerto Rico, Rhode Island, Illinois, Indiana, Wisconsin, Alabama, Florida, and Georgia) should also refrain from violating the terms of service will govern access to non-public computer systems to avoid liability under the CFAA. Researchers in those jurisdictions planning to violate the terms of service for access to computer systems open to the public are in a slightly better position, but they still face considerable risk.
Outside the First, Seventh and Eleventh Circuits, we believe that the view that the CFAA is an anti-intrusion statute should hold sway, and that mere violations of terms of service will not trigger liability under the CFAA. Of course, the Supreme Court may hold otherwise and we will be watching the case of Van Buren v. United States with great interest.
At the moment, (at the time of recording) this is clearly the law in the Ninth Circuit (Alaska, Arizona, California, Hawaii, Idaho, Montana, Nevada, Oregon, Washington), the Fourth Circuit (Maryland, North Carolina, South Carolina, Virginia, West Virginia), and the District of Columbia.
- The DMCA contains three provisions targeted at the circumvention of technological protections. The first is subsection 1201(a)(1)(A), the anticircumvention provision. The second and third provisions are subsections 1201(a)(2) and 1201(b)(1) the anti-trafficking provisions. Subsection 1201(a)(1) differs from both of these anti-trafficking subsections in that it targets the use of a circumvention technology, not the trafficking in such a technology. The anti-trafficking provisions are targeted to both access and copy control, but it is important to note that the DMCA does not contain a ban on the act of circumventing copy controls themselves. The DMCA makes it unlawful to circumvent a TPM that “effectively controls access” to a copyrighted work. 17 U.S.C. § 1201(a)(1)(A) (2012). The law does not prohibit circumventing a TPM that controls specific uses of a work without denying access altogether. However, it is unlawful to distribute any tool or device that would be primarily used for either of these purposes--i.e., circumventing access or use TPMs. Id. § 1201(a)(2), (b)(1). ↵
- See §1203 (civil), §1204 (criminal). It also authorizes a court to grant temporary and permanent injunctions on such terms as it deems reasonable to prevent or restrain a violation of the anti-circumvention provisions. See §1203(b)(1)(injunctions). ↵
- Id. § 1201(c)(1) (“Nothing in this section shall affect rights, remedies, limitations, or defenses to copyright infringement, including fair use, under this title.”). See Universal City Studios v. Reimerdes, 273 F.3d 429 (2d Cir. 2001); MDY Indus., LLC v. Blizzard Entm’t, Inc., 629 F.3d 928 (9th Cir. 2010). The Federal Circuit requires that the act of circumvention has some potential nexus to copyright infringement, but does not go so far as to make fair use a defense to the anti-circumvention rules. See Chamberlain Grp., Inc. v. Skylink Techs., Inc., 381 F.3d 1178, 1203 (Fed. Cir. 2004); Storage Tech. Corp. v. Custom Hardware Eng’g & Consulting, Inc., 421 F.3d 1307 (Fed. Cir. 2005). ↵
- I am assuming here the e-book DRM “effectively controls access” to a copyrighted work. ↵
- 17 U.S.C § 1201(f) (2012) (discussing reverse engineering); id. § 1201(g) (discussing encryption research). ↵
- U.S. Copyright Office, Section 1201 of Title 17 114-15 (2017), https://www.copyright.gov/policy/1201/section-1201-full-report.pdf. While temporary exemptions must be renewed every three years, the Copyright Office has instituted streamlined procedures to allow for the renewal of previously granted exemptions on the existing evidentiary record. Id. at 143-46. ↵
- Article 3(3) provides that "Rightholders shall be allowed to apply measures to ensure the security and integrity of the networks and databases where the works or other subject matter are hosted. Such measures shall not go beyond what is necessary to achieve that objective." ↵
- The only court to hold to the contrary is the Federal Circuit in Atari Games Corp. v. Nintendo of America Inc., 975 F.2d 832, 843 (Fed. Cir. 1992). However, that court’s reasoning is inconsistent with later Supreme Court precedent and has been expressly rejected by subsequent courts. See NXIVM Corp. v. Ross Institute 364 F.3d 471 (2d Cir. 2004). ↵
- The Supreme Court has recently reiterated its “skepticism about whether bad faith has any role in a fair use analysis.” See Google LLC v. Oracle America, Inc., 141 S. Ct. 1183 (2021) ↵
- Predicting Fair Use, 73 OHIO STATE LAW JOURNAL 47– 91 (2012) ↵
- Michael Carroll, Copyright and the Progress of Science: Why Text and Data Mining Is Lawful, 53 UC Davis L. Rev., 893, 898 (2019). See also, Mark A. Lemley, The Fruit of the Poisonous Tree in IP Law, 103 Iowa L. Rev. 245, 248 (2017) ↵
- Singapore, for one. ↵
- 18 U.S.C. § 1030 (2012). ↵
- Musacchio v. United States, 136 S. Ct. 709, 713 (2016). ↵
- Facebook, Inc. v. Power Ventures, Inc., 844 F.3d 1058, 1067 (9th Cir. 2016). ↵
- See Brown Jordan Int’l, Inc. v. Carmicle, 846 F.3d 1167, 1174-75 (11th Cir. 2017); United States v. John, 597 F.3d 263, 272 (5th Cir. 2010); Int’l Airport Ctrs., L.L.C. v. Citrin, 440 F.3d 418, 420-21 (7th Cir. 2006); EF Cultural Travel BV v. Explorica, Inc., 274 F.3d 577, 583-84 (1st Cir. 2001). The Eleventh Circuit has acknowledged criticism of its decision in Rodriguez in a way that clearly invites Supreme Court review, but continues to adhere to it nevertheless. See EarthCam, Inc. v. OxBlue Corp., No. 15-11893, 2017 WL 3188453, at *9 n.2 (11th Cir. July 27, 2017). ↵
- hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985, 1001 (9th Cir. 2019) (Concluding for the purpose of a preliminary injunction that the hiQ Labs had “raised a serious question as to whether the reference to access ‘without authorization’ limits the scope of the statutory coverage to computer information for which authorization or access permission, such as password authentication, is generally required.) ↵
- Id. ↵
- The Question Presented in Van Buren v. United States, is “Whether a person who is authorized to access information on a computer for certain purposes violates Section 1030(a)(2) of the Computer Fraud and Abuse Act if he accesses the same information for an improper purpose.” The case was argued in December 2020 and had not been decided as of May 14, 2021. For a review of the argument, see https://www.scotusblog.com/2020/12/argument-analysis-justices-seem-wary-of-breadth-of-federal-computer-fraud-statute/ ↵