2 International and cross-border copyright

Sean Flynn and Matthew Sag


Suppose that you are managing a collection of 1970s environmental catastrophe themed fiction and making it available for text data mining research in the United States. Here are some basic questions to think about:

  • Should you allow foreign researchers to query the corpus?
  • Should you accept new additions to the collection from an overseas library?
  • Are you in a position to send a copy of the corpus to overseas researchers?
  • Does it matter if those researchers are housed in a university, a corporate sponsored think tank, or a for-profit corporation?

These questions illustrate some of the issues raised by text data mining research in an international or cross-border environment.

In the materials that follow, we are going to introduce some of the conceptual building blocks that you will need to be able to understand and address these kinds of issues. Our aim isn’t to make you experts in comparative and international copyright law, but we hope to give you enough information so that you can identify potential areas of concern and understand how to structure cross-border collaboration in TDM research without taking on unnecessary risks.

The relation between domestic and international copyright law

The first step in appreciating the kinds of international and cross-border copyright law issues that might be relevant to text data mining research is understanding the relationship between domestic and international copyright law.

Copyright law is harmonized across the globe by virtue of various international agreements. The most relevant international copyright treaties are the Berne Convention and the World Trade Organization Agreement on Trade Related Aspects of Intellectual Property Rights (or the TRIPs Agreement, for short). These agreements establish minimum standards for copyright protection, that more or less every country in the world has agreed to adopt as part of their domestic copyright law.

There is a lot of agreement about many aspects of copyright law around the world, but that agreement is often at a high level of generality. Digging a bit deeper, we find meaningful diversity in how countries choose to implement their international copyright obligations.

As a result, particularly in relation to the issues surrounding text data mining research, copyright law can vary significantly from one country to the next.

So, although international agreements provide important background principles, the law we generally need to focus on is the domestic copyright law of individual countries.

That sounds simple enough, but we have to complicate this story slightly with respect to the European Union. Copyright law in the EU is harmonized by a series of EU directives. These directives must be implemented in the national law of the various member states, but in many cases the EU directives also have direct effect. This feature of European law explains why in some cases you will hear us talk about European copyright law as though it was a single consistent body of law—sometimes this is just a helpful generalization—and yet in other cases we focus in more detail on the laws of individual countries.

Copyright protection and limitations and exceptions for TDM research

Here we want to go over the basic steps of analysis to determine whether you have a copyright issue in an international text and data mining research project. Assume for the moment that you are trying to decide whether you can locate a particular research activity in another country in which you have a research partner.

I assume here that you might want to undertake the following activities in a TDM project:

  • Reproducing whole works to create a database or corpus;
  • Sharing a database with other researchers (either in the country or across borders);
  • Finding and reporting facts through use of the database;
  • Quoting the materials mined for validation and illustration.

One or all of these activities might take place in another country or between researchers in other countries. This section will focus on what kind of laws you can expect to find in different countries.

Exercise: Keep track of what you learn in your own copy of the TDM Activities Worksheet. To use the worksheet, make a copy of it and then add your information directly into your copy.

Scope of protection

Our goal here is to give you information about what aspects of copyright law are near universal and what the main variations are so you can do what we law profs call issue spotting. That is, be able to spot where there is likely to be or likely not be a real legal issue that you might need to dig more deeply into. To answer a specific question with regard to a specific country you may need to dig a little deeper into the individual context.

As we covered with respect to US law, there are two basic stages to any copyright analysis. First you look to whether the work and intended activity are within the scope of copyright protection. Second, if the work and activity fall within the scope of protection, then you look to whether a limitation or exception to the exclusive rights none-the-less permits the activity.

Is the work protected?

By now you probably all realize that working with resources in the public domain can resolve all of your copyright concerns. However, determining what is in the public domain may be somewhat difficult.

Definition of a protected work

The definition of protected works in every copyright law is incredibly broad, in part because international law requires a broad definition of protected works.

The Berne Convention defines a protected work as “every production in the literary, scientific and artistic domain, whatever may be the mode or form of its expression.” The Convention gives an illustrative list:

  • books,
  • dramatic or choreographic or cinematographic works,
  • musical compositions,
  • drawing, painting, architecture, sculpture,
  • photography;
  • applied art;
  • maps

What about government works?

Unfortunately, you cannot assume that a work is freely usable because it is a government work – even a law.

The Berne Convention, allows, but does not require, an exemption for official texts, such as laws. The US exempts these texts from copyright. But some countries—including the UK and many commonwealth countries—protect such works.

What about old works?

The Berne Convention states a minimum required term of protection of life of the author plus 50 years. But countries can protect longer, and many do.

Most of the countries in Africa and Asia protect copyright for life plus 50 years, or sometimes less. (Not all countries have signed on to the Berne limits.) And Berne allows countries to apply lower terms to photographs—as few as 25 years.

But about half the countries in the world protects works for longer than life plus 50 years. Mexico tops that list with terms of life plus 100 years.

The result is that some older works may be subject to copyright in the U.S. but in the public domain overseas, and vice versa.

Is the Activity Protected?

If you conclude – or prefer for simplicity to assume – that a materials you want to use is a protected work, then the next question you will have is whether your use of that work is subject to an exclusive right of the copyright holder.

There is a fair amount of uniformity on this question.

Berne requires that copyright laws protect against reproduction “in any manner or form.”

Laws normally require that a substantial amount of the work be copied to constitute a reproduction. But there are courts that have held that as few as 11 words from a work can constitute a substantial reproduction (EU).

Countries have generally implemented the reproduction right broadly. German law, for example, excludes all copies by whatever method in whatever quantity.

So here, think about whether any or all of the activities you might want to undertake for TDM involve a reproduction of the work in any method and in any quantity.

There are more rights

The reproduction right—which is the most central and oldest right in copyright—is certainly incredibly broad. But international laws have expanded on the definition over time, adding new exclusive rights for activities that may not involve a technical reproduction at all.

First Berne requires protection against the translation or adaptation of works. Some prominent commenters have opined that translation and adaptation rights may apply not only between human languages, but also “translations from one computer language to another.”[1]

And later treaties require that countries protect the right to “distribute,” “communicate,” or “make available” a work.

It is generally accepted that a distribution can take place when one transfers the work to another person, whether that be a hard copy or sharing a file.


Now, some transfers are exempted from the distribution right. Copyright’s exclusive right to control the distribution of a work within the same country is “exhausted”—that is, the right ceases to bind – after the first sale of that work. This is why used book stores can occur and why you can gift a book to another person. But in some countries that exhaustion does not apply outside of the country where the first sale occurs. And in very few countries does the exhaustion rule apply to a digital copy.

Also note that making available rights can be infringed through allowing members of the public to access works from a place and at a time individually chosen by them. Can that be sharing a link to a dropbox file? What if you allow any researcher—the broad “public” in other words—to use your research corpus and thereby “access” the works you have made a copy of?

If we end here, the copyright environment looks pretty daunting. There may be limiting interpretations of these concepts in domestic laws or court decisions. But at least on their surface, you may be able to conclude that all of the uses of works we discussed above, and maybe some more you have since thought of, are subject to copyright law on their face. Thus, for a great many text and data mining project activities, you are going to need help from the next section—limitations and exceptions.

Universal exceptions and limitations

Recall the purpose of copyright. Copyright exists to prevent competing uses of protected works. We sometimes think of these as public uses. Uses that can substitute for the original work in a way that harms the market for the work.

Under this general theory, uses of a work that cannot substitute for the work in the market—e.g. because they are confined only to a use in the home, like copying your CD to your hard drive—should not be protected. Why? Because that use does not share the work with anyone in a way that can displace a use.

In the last section we showed that the definitions of exclusive right appear to protect many uses such as private, at home, use. But that use is lawful in probably every country in the world. Why? Because of the presence of exceptions to copyright.

Some of the most important limitations and exceptions to copyright are required by international copyright agreements, such as the Berne Convention and TRIPs. We refer to these as “universal.”

Exclusion of facts

The first important exception required by international law—and often via freedom of expression rights—is the exclusion of facts. All copyright laws around the world apply only to original expression, not to the facts conveyed by that expression. The Berne Convention requires this distinction – expressly excluding protection of “news of the day” and “miscellaneous facts having the character of mere items of press information.”[2]

The WTO TRIPS Agreement expands on this aspect, requiring what is often referred to as the “idea-expression distinction.” “Copyright protection shall extend to expressions and not to ideas, procedures, methods of operation or mathematical concepts as such.”[3]

A basic example of the difference between facts and expression is an article about a sports tournament and the score. The score may be included in the article and may be where you got that information. The newspaper has an exclusive right over the article—the original expression of the sports writer describing the event. But the score is a fact. You can use the fact freely, even if you can’t copy the article.

The problem of course arises in how you access that fact without copying the expression. You can read the article. We all admit that. But can you mine it? If you have to copy the work to mine it for its facts you may need more.


International law also requires the right of quotation.[4] Berne does not go into a lot of detail about what the quotation right means. But we can generally assume that it means only the use of an excerpt of the work, not the whole work. So this exception does not likely give researchers a right to make whole copies of works to create a database to be mined. But it may be useful in communicating and illustrating the results of such research.

Some national copyright laws authorize quotation for any purpose;[5] some explicitly exempt research purposes.[6] The most limited quotation rights require criticism or review of the work quoted. Pause there and ask yourself—and note in your worksheet—whether a quotation exception limited to “criticism and review of the work quoted” would be sufficient to authorize the quotes you want to make for publication and validation purposes of your project.

Review your worksheet now and fill out as much of the third column you can through application of these universal exceptions to copyright protection. What do you have left? You will need to fill in the empty spaces in your worksheet in the next session analyzing specific laws in specific counties. Here the law gets a little more complicated.

National approaches to copyright limitations and exceptions

You should have concluded that there are some activities that TDM researchers need to do that should be permitted in every country by virtue of the idea/expression dichotomy and the right of quotation.

But these universal exceptions are not sufficient to authorize all of the activities that TDM researchers need to do. This may be true even where that activity does not appear to compromise copyright law’s core objective of prohibiting the making of copies that can substitute for the work in the market. Unfortunately for us, the manner in which countries protect the interests of users in making non-competitive uses of works varies significantly.

Beyond the mandatory exceptions and limitations, international law leaves countries largely free to craft exceptions for uses that do not harm the interests of copyright protection.[7] The so-called three-step test in Berne allows countries to permit any use that “does not conflict with a normal exploitation of the work and does not unreasonably prejudice the legitimate interests of the author.” That should sound a lot like the fair use factors you learned about previously. The trick is that some, but not all, countries take full advantage of this flexibility to exempt non-competitive uses from copyright control.

Let’s start with the conclusion. A map of the world based on whether you can reproduce and share copyrighted works for sole purpose of research—without sharing those works to the general public—looks like this:

Comparative Copyright Law on Research Exceptions, Sean Flynn, Andres Isquierdo, Mike Palmedo, PIJIP (2020)

I say “law on the books” meaning the copyright statute itself. In application, there may other rights—such as human rights to receive and impart information—that may make the rigid application of the law in these countries to ban data mining unconstitutional. This seems a likely outcome in Brazil, for example.

And so it appears to be the case that in most countries of the world the law appears open to the interpretation that you could make the necessary copies needed to create a database for a “private” TDM project. But also in most of the world there is a lack of a clear right to share those copies with another researcher.

In the next part we will describe in more depth what the provisions of the law look like that we are interpreting here.

Open and General Exceptions

An exception can be general or specific; open or closed—on a continuum.

By general I mean that a single exception applies one balancing test—e.g. to fairness—to a group of different purposes. Specific exceptions apply to only one (or sometimes a couple of related) purpose of use.

By open I mean that the exception applies to the full scope of protection. It covers all rights, all works, and by any user.

A fully open general exception applies a single balancing test to a use of any work, by any user, for any purpose. Fair use is such an exception. But it is not the only one. And a fully open research exception can be just as useful for a TDM researcher than a fully open general exception.

I am going to use this map to go through the different kinds of exceptions that could authorize the making or sharing of TDM databases.

The general and open exceptions for research are labeled in Green. In those countries, the copyright exceptions on the books are phrased broadly enough to permit both the making, and sharing between researchers, of a TDM database.

Let me start with the fair use and fair dealing countries.

Fair use and fair dealing

The US fair use right is an open general exception. It applies one basic fairness to assess the permissibility of any utilization of a work that implicates any exclusive right, by any user, of any work, for any purpose.

General exceptions are most common in, but not exclusive to, countries from the common law tradition evolving from the United Kingdom. Such exceptions often provide a general defense for “fair use” or “fair dealing.”

I want to address what I see as a common misconception about the difference between fair use and fair dealing. The misconception is that fair use is a more open right than fair dealing. That is not universally true.

In the US and some other countries, the term for the utilization permitted by the exception is “fair use.” In the UK and many other commonwealth countries, the historical term used for a permitted utilization is “fair dealing.” Almost always the word “use” or “dealing” mean to apply to the exercise of any exclusive right.[8]


Copyright and Related Rights Act, 2000

Article 50.

(1) Fair dealing with a literary, dramatic, musical or artistic work, sound recording, film, broadcast, cable programme, or non- electronic original database, for the purposes of research or private study, shall not infringe any copyright in the work.


The Copyright and Performance Rights Act, 1994

Article 21. Acts which do not constitute infringements

. . .

(a)   fair dealing with a work for private study or for the purposes of research done by an individual for his personal purposes, otherwise than for profit.

Notice that “use” and “dealing” mean the same thing. They both apply to any type of utilization of the work, that is—a utilization that implicates any exclusive right of the copyright holder.

In this example, the Australian fair dealing right is subject to a closed list of purposes and the US fair use right has an open list. The magic words to look for here are “such as.”

But is not true that “fair use” rights are open and fair dealing rights are closed. Look at these two examples.

The Uganda fair use right is not open. And the Malaysia fair dealing right is not closed.

This distinction is unlikely to matter here since most fair use and fair dealing rights explicitly apply to “research” purposes.

Other general exceptions

There are also general exceptions that are not fair use or fair dealing rights. Indonesia has a general exception for any “use” of a work for research or other purposes.


Law of the Republic of Indonesia No. 28 of September 16, 2014

Article 44.

(1) The use, retrieval, duplication, and amendment of a copyright work or a related right in whole or in part is not considered as a violation of copyright if the source is stated or stated in full for the purposes of:

  1. education, research, writing scientific papers, preparing reports, writing criticisms or reviewing a problem without harming the reasonable interests of the Creator or Copyright Holder

Thailand simply makes the entire scope of the Berne three-step test a general exception.


Section 32. Exceptions to Infringement of Copyright

An act against a copyright work under this Act of another person which does not conflict with normal exploitation of the copyright work by the owner of copyright and does not unreasonably prejudice the legitimate rights of the owner of copyright shall not be deemed an infringement of copyright.[9]

The Republic of Korea combines the Thailand approach to the three-step test with the fair use multi-factor test:

Republic of Korea

Copyright Act (Act No. 432 of January 28, 1957, as amended up to Act No. 14634 of March 21, 2017)

Article 35-3. (Fair Use of Works, etc.)

(1) Except as provided in Articles 23 through 35-2 and 101-3 through 101-5, where a person does not unreasonably prejudice an author’s legitimate interest without conflicting with the normal exploitation of works, he/she may use such works.

(2) In determining whether an act of using works, etc. falls under paragraph (1), the following shall be considered:

  1. Purposes and characters of use including whether such use is for or not-for nonprofit;
  2. Types and natures of works, etc.;
  3. Amount and substantiality of portion used in relation to the whole works, etc.;
  4. Effect of the use of works, etc. on the current or potential market for or value of such work etc.

Open research exceptions

I have also labeled in green specific exceptions for research that are sufficiently open to apply to the use of all works and apply to both reproduction and sharing rights that we are most concerned with.

Some research rights are open to application to all exclusive rights. E.g.


Law on Copyright and Neighboring Rights (Copyright Law) (version as of 1 June 2016)

Article 22. Privileged uses of the work

1) Published works may be used for special purposes. A special purpose is:

  1. a) any use of the work in the personal sphere and in the circle of persons who are closely related, such as relatives or friends;
  2. b) the use of the work for illustration in class or for scientific research insofar as this is justified for the pursuit of non-commercial purposes and if possible the source and the name of the author are given;

c ) the reproduction of the work on paper or a similar medium by means of photomechanical processes or other processes with a similar effect for educational purposes, for scientific research or for internal information and documentation in companies, public administrations, institutes, commissions and similar institutions;

  1. d) digital reproduction for educational purposes and for scientific research without any direct or indirect economic or commercial purpose.

Some of the specific exceptions for data mining are also open framed. Japan applies to any “exploitation,” including for data analysis.


Article 30-4. Exploitations not for enjoying the ideas or emotions expressed in a work

It is permissible to exploit work, in any way and to the extent considered necessary, in any of the following cases or other cases where such exploitation is not for enjoying or causing another person to enjoy the ideas or emotions expressed in such work; provided, however that this does not apply if the exploitation would unreasonably prejudice the interests of the copyright owner in light of the natures and purposes of such work, as well as the circumstances of such exploitation:

(i) exploitation for using the work in experiments for the development or practical realization of technologies concerning the recording of sounds and visuals or other exploitations of such work;

(ii) exploitation for using the work in a data analysis (meaning the extraction, comparison, classification, or other statistical analysis of language, sound, or image data, or other elements of which a large number of works or a large volume of data is composed; the same applies in Article 47-5, paragraph (1), item (ii));

(iii) in addition to the cases set forth in the preceding two items, exploitation for using the work in the course of computer data processing or otherwise that does not involve perceiving the expressions in such work through the human sense (in regard of works of computer programming, the execution of such work on a computer shall be excluded).

Other research exceptions, although not open to every “use,” nonetheless specifically make provision for both reproduction and sharing. E.g.


Law of April 18, 2004, amending Law of April 18, 2001 on Copyright, Neighboring Rights and the Databases

Article 10.

When the work has been lawfully made available to the public, the author may not prohibit:

2. The reproduction and communication to the public of works by way of illustration of teaching or scientific research and to the extent justified by the aim to be achieved and provided that such use is in accordance with good practice.

Germany makes similar provision in its recent law focused specifically on authorizing text and data mining:


Section 60d. Text and data mining

(1) In order to enable the automatic analysis of large numbers of works (source material) for scientific research, it shall be permissible:

  1. to reproduce the source material, including automatically and systematically, in order to create, particularly by means of normalisation, structuring and categorisation, a corpus which can be analysed and
  2. to make the corpus available to the public for a specifically limited circle of persons for their joint scientific research, as well as to individual third persons for the purpose of monitoring the quality of scientific research.

As we discuss below, most current TDM laws in the EU do not make this provision for sharing and the EU directive does not require it.

We have labeled all the laws in this section GREEN. These are laws that, on their face at least, appear to authorize reproduction and limited sharing between researchers of all works by any user for a research purpose.

Non-expressive uses as fair practice

The work in all these exceptions is done in the balancing test used to determine if a particular use is permitted. Sometimes there is a multi-factor test like US fair use. Sometimes it is a single test like “fair practice.” In any case, the balancing factor gives an opportunity for calibration of exclusive rights to promote copyright’s purposes. A central question in each will be whether the use unfairly competes with the original.

If you are making a copy of works into a private database that will not be released to the public in any way, then the test should be readily passed. This was the holding in US courts in the Google Books, HathiTrust and other cases.

Reproduction for research

Now we move to the countries I have marked in Blue in the map. The difference between from the last category is that blue countries only authorize reproduction, not distribution or communication rights. As a result, whether a researcher can copy and transfer a whole database to another researcher in these countries is either very unclear or clearly prohibited.

The simplest of these exceptions provide exceptions for reproduction for research. The key here is that it only allows reproduction, not distributions or communications.[10]


Law No. 2-00 on Copyright and Related Rights (2000))

Article 54. Free Uses (Research)

Notwithstanding the provisions of Articles 50 to 53, the following acts shall be permitted without the authorization of the successors in title mentioned in these articles and without the payment of a fee:

(b) reproduction solely for the purposes of scientific research;


Section 29.

Section 25,26,27 and 28 shall not apply where the acts referred to in those sections are related to:


(b)  reproduction solely for scientific research;

Sometimes the research right is included within in a private use or private study right, as in Somoa. What were are looking for in a connector like “or” that makes clear the research right is separate from the private use right.[11]


Copyright Act 1998 (as consolidated in 2011)

Section 8A. Reproduction for purposes of research or private study

(l) Despite section 6(1)(a), but subject to subsection (2), a person reproducing a work for the purposes of research or private study is not to be regarded as infringing any of the copyright in that work.

(2) Despite subsection (1), if a person reproducing the work knows or has reason to believe that it will result in copies of substantially the same material being provided to more than one person at substantially the same time, that person will not be regarded as reproducing the work for the purposes of subsection (1).

As we will discuss further below, the EU directive on text and data mining only requires that EU countries have an exception for reproduction, not for distributions and communications even between researchers.[12]

European Union (EU)

Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market (DSM Directive)

Article 3. Text and data mining for the purposes of scientific research

  1.   Member States shall provide for an exception to the rights provided for in Article 5(a) and Article 7(1) of Directive 96/9/EC, Article 2 of Directive 2001/29/EC, and Article 15(1) of this Directive for reproductions and extractions made by research organisations and cultural heritage institutions in order to carry out, for the purposes of scientific research, text and data mining of works or other subject matter to which they have lawful access.

Private reproduction

Another category of exception that may be useful in authorizing TDM research activities are private use rights.

These rights generally allow researchers and others to make a copy (often just one) of a work, including for a research purpose. Often these rights apply to making copies of whole works. Where broadly phrased, private use rights may thus permit the making of a database for TDM. E.g.


Copyright Act, 2016 (Act No. 26 of 2016), https://wipolex.wipo.int/en/text/446811

Article 38. (Personal or Private)

The reproduction, translation, adaptation, arrangement or other transformation of a work exclusively for the user’s own personal or private use of a work which has already been lawfully made available to the public shall be permitted:  Provided that it is made on the basis of a representation that the authorized under this Act at the initiative of the user and not for the purpose of gain and only in single copies.


Article 17. Free Use of Works and Phonograms for Personal Purposes

  1. It shall be permissible to reproduce one copy of works previously published lawfully for personal purposes without the consent of author or other copyright owner and without payment of author’s remuneration, on nonprofit base.

There are several common restrictions in private use rights. First, as in the example above, often these rights contain express prohibitions of commercial or for-profit use. Even where such express limitations are not provided, they may be implied by the definition of “private.”

Similarly, the definition of “private” is often expressly limited to a natural or physical person. A corporation, university or research institution cannot normally rely on a private use exception to create a TDM database unless there is a separate right of such institutions.

Private use rights do not generally extend to sharing of the copied work. The rights may limit sharing by extending only to a reproduction – not a distribution or communication of the work. Or sometimes the rights include an internal restriction making clear that sharing is not permitted.[14]

Finally, many private use rights often explicitly forbid making copies of a “database,” and sometimes specifically an electronic database. We already assume that private use rights are not sufficient to authorize the copying of a TDM database to share with other researchers. This is sometimes very explicit. E.g.

Burkina Faso

Law No. 032-99/AN of December 22, 1999, on the Protection of Literary and Artistic Property

Article 21. Private/personal use

Where a work has been legally disclosed, the author may not prohibit: …

– copies or reproductions reserved strictly for the private use of the copier and not intended for collective use, with the exception of: … the total or substantial reproduction of databases;

Thus, in the best case, private use rights may be sufficient in many countries to authorize an individual researcher to create a corpus of works for TDM activities. But they are not likely to be sufficient to authorize the sharing of the database between researchers in ways that require reproduction of the database itself.

Restricted private use rights (yellow)

Some private use rights are further restricted in ways that would allow the creation of only some kinds of TDM databases. We have flagged these countries in yellow.

The most prominent example here is the relatively frequent restriction from using private use rights to copy a whole book. E.g.

Russian Federation

Civil Code of the Russian Federation (Part Four, as amended up to Federal Law No. 549-FL of December 27, 2018, and Federal Law No. 177-FL of July 18, 2019)

Article 1273. Free Reproduction for Personal Purposes

  1. A citizen may reproduce, if necessary and exclusively for personal purposes a legally promulgated work without the author’s or other right holder’s consent and without paying a fee, except for the following:

2) the reproduction of databases or significant parts thereof, except as provided for by Article 1280 of this Code;

4) the reproduction of books (in full) and musical notation texts (Article 1275), that is the facsimile reproduction with the help of technical facilities for the purposes other than publication;

Excerpts only (red)

Finally, some private use rights are not useful for TDM projects at all because they are limited to the use of excerpts, and therefore function in reality as quotation rights.

My favorite example here is from Argentina, which has the most restrictive copyright exceptions I have ever seen. There is just one exception to copyright and it is only for quotation.


Law No. 11.723 of September 28, 1933, on Legal Intellectual Property Regime (Copyright Law, as amended up to Law No. 26.570 of November 25, 2009)

Article 10. Any person may publish, for didactic or scientific purposes, comments, criticisms or notes referring to intellectual works, including up to 1,000 words for literary or scientific works, or eight bars in musical works and, in all cases, only the parts of the text essential for that purpose.

This provision shall cover educational and teaching works, collections, anthologies and other similar works.

Where inclusions from works by other people are the main part of the new work, the courts may fix, on an equitable basis and in summary judgment, the proportional amount to which holders of the rights in the works included are entitled.

So there you have the world.

There are a number of countries we cannot find or translate the law. They are left in grey.

The number of countries where you cannot make a TDM database at all is relatively small, but clustered in some huge and important countries to our South.

On the other hand, the number of countries where you can both make and share a TDM databases with other researchers is also relatively small, although it includes some very large and important places.

The question for the next section is how to approach the matter when you are in a green country but want to do a project with a colleague in a blue, tallow or red one. Does the law there restrain you here?

Library and research institution exceptions

One final source of copyright exception that may extend to the creation of a text and data mining database is in exceptions for libraries and research institutions. Many national copyright laws contain special exceptions for uses by libraries which may contain rights to make copies for third party research projects. It’s possible that such exceptions could be helpful in relation to text data mining research, but again, we would have to look at these country-by-country to say much more than this.

Temporary reproductions

A significant number of more recently amended national copyright laws allow for temporary reproductions to carry out technical processes. Depending on the technical process being utilized, a limited right to make temporary reproductions may be enough to engage in text data mining research. Storing copyrighted works in a database is not likely to qualify as a temporary reproduction. But an exemption for temporary reproduction should apply where copyrighted works are stored briefly (briefly as in seconds, not weeks), analyzed to derive relevant metadata and then deleted.

Specific exceptions for TDM research

One reason why copyright law treats text data mining research differently in different countries is that some jurisdictions have amended their copyright laws with text data mining in mind, whereas most have not. But even where legislative accommodations have been made, the text and intent of the relevant provisions varies.

Only a handful of countries have specific exceptions for TDM research. In 2009, Japan became the first country to adopt an express exemption for text data mining. Between 2014 and 2018, the United Kingdom,[13] France,[14] Estonia,[15] and Germany[16] also enacted laws specific to text data mining. In 2019, the European Union adopted the Digital Single Market Directive which includes two separate provisions meant to enable TDM research under different conditions.

None of these laws are exactly the same, and they probably all differ from the legal position in the United States to some degree.

Because of this lack of uniformity, even cross-border research collaborations between jurisdictions that both support TDM research might run into obstacles.

To give you a sense of what these obstacles might be, we are going to summarize some of the key points of differentiation between the law as we understand it in the United States and those jurisdictions that have enacted copyright exceptions meant to enable TDM research.

Exclusion of “commercial” research

There doesn’t appear to be any relevant commercial/non-commercial distinction with respect to TDM research and fair use in the United States.[17] In contrast, the UK text mining provision is limited to non-commercial research, and the European DSM Directive takes a bifurcated approach: the robust text mining rights in Article 3 only apply to non-commercial research institutions; whereas the weaker rights in Article 4 are available to all.

It’s possible that when other jurisdictions address the question of text data mining and “fair use” or “fair dealing” that they might draw a distinction between commercial and non-commercial users. We don’t think that this is how the law should be interpreted, but courts don’t always do what we think they should do.

Finally, on this point of commercial use, it’s also worth repeating that some of the general research rights we discussed before only apply to non-commercial research.

Exclusion of some exclusive rights

In the United States, the non-expressive use of a work in relation to text mining will not infringe any of the copyright owner’s exclusive rights. The situation is not so clear overseas.

The text mining provisions in Articles 3 and Article 4 of the European Union Digital Single Market Directive apply to the reproduction right, but they don’t apply to the European right of “communication to the public,” the right of “making available to the public,”[18] or the right of adaptation.[19] Although the reproduction right will usually be the primary concern of a text mining researcher trying to establish a corpus, these other rights could be triggered by subsequent uses of the corpus.[20]

Lawful access

The EU Directive and some other laws require that TDM databases be made only with works to which the researcher has “lawful access.” This is not required by any of the U.S. precedents on text data mining.[21]

Overriding contractual and technological restrictions

Article 3 of the DSM Directive does not allow private contracts (e.g. a publisher’s license) to override the data mining right. There is no rule like this in the United States. The fact that a US researcher violated a contract that limited her ability to engage in text mining is unlikely to detract from her assertion of fair use; but her fair use argument is equally unlikely to count for much in a breach of contract suit.

We don’t yet have any guidance on how the EU contractual override provision interacts with their “lawful access” requirement.[22]

The rights under Article 3 of the DSM Directive are also not subject to the usual restrictions that apply to overcoming “technological protection measures” or “digital rights management” restrictions on access. Again, this is not the law in the US.

Security measures and retention of copies

In the United States, the fair use status of TDM research may be contingent on taking reasonable security measures to protect the corpus from unauthorized use beyond the parameters of fair use.

Article 3 of the DSM deals with the retention of works copied as part of a text mining process in a similar way. Under the Article 3 exemption, the covered organization must adopt an “appropriate level of security” and may retain the works “for the purposes of scientific research, including for the verification of research results.”[23]

However, researchers relying on Article 4 face much more restrictive conditions. Under Article 4, the works may be retained only “for as long as is necessary for the purposes of text and data mining.”[24]

Territorial rights in a globally networked world

Determining which territory’s law applies

By now it should be clear that although the broad outlines of copyright law are fairly consistent from one country to the next, there are, nonetheless, some important differences that might be relevant to TDM research. The question we need to grapple with now is, to what extent are these differences a problem for TDM research in a world of cross-border data flows and international collaboration?

Copyright law is inherently territorial. United States copyright law wouldn’t take any interest in an unauthorized reproduction or performance that takes place entirely overseas. A pirated DVD sold on the streets of London doesn’t violate US copyright law unless and until someone tries to bring it into the US. As far as we know, other countries feel the same way. By the same token, if a movie was in the public domain in the United States, but still subject to copyright in Italy, you couldn’t sell pirated DVDs of that movie in the streets of Rome and expect to have US law applied. Indeed, because copyright law is inherently territorial, the advice “When in Rome, do as the Romans do” makes a lot of sense.[25]

However, the problem with global communications networks is that, as far as copyright law is concerned, you might simultaneously be in Rome, Sydney, Chicago, and Beijing.

Because the “harm” of copyright infringement consists simply of trespassing on the copyright owner’s exclusive rights in a given jurisdiction, it is possible that simply making a work available on a server in one country could constitute copyright infringement in multiple countries.

Usually, foreign courts won’t be interested in trivial or incidental cross-border infringements.[26] Generally courts only take an interest in infringers that intentionally target their jurisdiction in the sense that they deliberately engage with an audience there. However, whether courts require intentional targeting of their jurisdiction, and how they interpret that requirement, both vary considerably.

The details of the activity matter

One of the most important things people tend not to understand about copyright law is that the details matter. Copyright is not a general right of exclusive advantage; copyright is a bundle of exclusive rights in relation to specific actions. In the vocabulary of the United States Copyright Act, copyright owners have the exclusive right to reproduce the work, make derivative works, distribute the work, and publicly perform or publicly display the work.

It’s important to understand what is not included in the copyright owner’s exclusive rights. Unless one of those exclusive rights is triggered, there is nothing wrong with “using” a copyrighted work, “learning” from it, or gaining some other advantage from it.

So, when we are thinking about international and cross-border copyright issues in relation to text data mining, we have to carefully evaluate which technical actions are being performed and what the copyright implications of those actions might be in different jurisdictions. We also need to think about the sometimes strange and metaphysical question of exactly where the action takes place.

We will go over some specific technical acts with respect to copyrighted works and explain their jurisdictional implications. Then we will take these basic principles and apply them to some common scenarios you might encounter in text data mining research.

Reproduction and making available


Reproduction is one of the core exclusive rights of the copyright owner. It is safe to assume that any reproduction made across a communications network can be thought of as taking place at either end. Thus, electronically transferring a file from country A to country B may well infringe the reproduction right at the source, and at the destination.

Making available

In jurisdictions that recognize a “making available to the public” right as part of copyright, simply making a work accessible online constitutes infringement, even if no one actually takes advantage of that accessibility. There is no “making available” right in the US (there is some disagreement here, but we are 99.9% sure) but this right is fairly common overseas.[27] If a copyrighted work is hosted on a server in country A and is accessible in country B, it has been “made available” in country B and could infringe the making available right in country B.

Distribution, performance and display


Technically, a digital download of a copyrighted work is both a reproduction and a distribution. However, the distribution right is essentially redundant in the online context because the reproduction right can do all of the heavy lifting.

The distribution right is also potentially triggered by simply transferring possession of a physical copy of the work from one person to the next. In general, the distribution right is infringed in the place where the work is received.

The distribution rights sounds incredibly broad, but the distribution right is limited by the “first sale doctrine” (other countries call this the doctrine of “exhaustion”). Once the copyright owner has sold or given away a particular copy of the work, she no longer has any right to control any subsequent distribution of that particular copy. She still has the right to control copying, but the copy she just sold should be free from post-sale restrictions.

In some countries, the principle of exhaustion only applies to a sale within that country. The United States takes a much broader view. Under US law, the copyright owner’s rights are exhausted by the first sale no matter where it takes place. The European Union takes a regional approach to exhaustion. So, a physical book sold in Paris can be resold in Berlin without further authorization, but a book sold in Pittsburg couldn’t be.

In the United States, the right to import and export copies of works is treated as a subset of the distribution right. Importing a work into, or exporting a work from, the U.S. infringes the distribution right if it is done “without the authority of the owner of copyright” under U.S. law and the making of the relevant copies either “constituted an infringement of copyright” under U.S. law or “would have constituted an infringement of copyright” if U.S. law had applied. It is worth emphasizing that U.S., not foreign law is the benchmark here.

Performance and Display

Even in the absence of a reproduction, copyright can be infringed by transmitting the work as a public performance or a public display. In the EU and many other jurisdictions, this would be a “communication to the public.” Streaming video and broadcast radio are both examples of public performance/communication through transmission.

For the purpose of thinking about cross-border issues, it seems safe to assume that a work is performed/communicated either in the place where the transmission was initiated, or in the place where it was received. However, only the person making the transmission violates the performance right. So, if a work is streamed from country A to an audience in country B, the person making the transmission may be liable in both jurisdictions, but the person receiving the transmission wouldn’t be liable in either.

The use of data derived from copyrighted works

The distinction between protectable original expression and unprotectable facts and ideas, is one of the universal building blocks of copyright law. The non-expressive metadata the results from text data mining research doesn’t, in and of itself, infringe the copyright in any of the underlying works from which it was derived. This is important. Building a research corpus usually involves substantial amounts copying. However, once the corpus has been created, the computational process of querying the database to produce metadata may have no copyright significance.

Derived metadata does not infringe copyright because the derived data is not, in any relevant sense, a copy of the underlying works.

This means that there should be no copyright issue with exporting derived data to another jurisdiction, even if the copying that was necessary to build the research corpus in the first place would not have been allowed there. It also means that there shouldn’t be any issue with allowing overseas researchers to query a U.S. corpus, so long as the results of those queries are confined to derived data.

Risk management

By now it should be clear to you that there are some theoretical cross-border copyright risks related to text data mining projects based in the United States that interact with the rest of the world. Our focus is primarily on how to identify and minimize those risks.

We can distinguish between theoretical risk and practical risk.

Here we use theoretical risk to refer to the technical application of the law on the books to the action in question to determine whether—if litigated—a court would likely find liability. We use the term practical risk to refer to the chance that the issue in question might actually be litigated. The two risks can operate separately from each other.

Sometimes there might be a high theoretical risk, but very low practical risk. Imagine a colleague emails you a copy of an article that you were missing from your database. There are countries where that appears illegal. But is the rule ever enforced?

On the other hand, there may be cases where the theoretical risk is very low but the practical risk is very high. The Google Books Project was a new, very public, and very large scale use of copyrighted works. Google knew its design of the project was compliant with fair use. But it surely also knew that if it wanted to carry the project through, it would have to budget in substantial litigation costs.

At the end of the day, you need to make your own judgment about practical risk, based on what we can tell you about theoretical risk. How you want to balance these risks and what you think is an acceptable level of risk are questions we can’t answer for you.

The distinction between theoretical risk and practical risk is quite important in the cross-border copyright context. Even if a US institution was judged to have violated copyright law in some overseas jurisdiction, the practical risk of litigation may be incredibly low. Assuming that the US defendant has no assets in the foreign jurisdiction, the foreign plaintiff would need to take legal action in their own jurisdiction, and then undertake a separate action in the United States to have the judgment enforced.

This might be especially challenging if the conduct complained of would be fair use under U.S. law because of the quasi-constitutional status of fair use. The Supreme Court has indicated that at least some aspects of the fair use doctrine and the idea-expression distinction are critical to the constitutionality of copyright law in light of the First Amendment. If a foreign judgment condemns activity that would be permissible under the fair use doctrine, the US defendant would be well placed to argue that the final judgment should not be enforced due to its conflict with public policy, namely the First Amendment.[28]

The outcome here is far from certain: the defendant would have to show much more than the simple fact that an American court would have come to a different conclusion, it would have to show that a finding in favor of the plaintiff would be repugnant to the First Amendment.[29] Nonetheless, this is a significant obstacle for a foreign plaintiff to overcome.


In this section we will work our way through TDM scenarios with the potential to raise cross-border issues. Our aim is to identify when overseas copyright law would be relevant and when it wouldn’t, and to address potential best practices in risk identification and mitigation.

We will also identify where there is potential to lobby for changes to copyright law at a national or international level that would improve research opportunities without undermining the legitimate interests of copyright owners.

We will try to focus here on use cases that are arguably within the boundaries of United States copyright law but might raise questions in other jurisdictions, or at least require us to know something about the law in other countries.

Building a corpus

Reproducing copyrighted works for the purpose of TDM in the US

Reproducing copyrighted works for the purpose of text data mining will be treated as fair use in the United States. As long as the reproduction takes place in the United States, there are no international or cross-border issues, even if the copyright is held by a foreign author or a foreign corporation. Foreign copyright owners have at least the same rights as American copyright owners under our system, but if they are objecting to something that happened in this country they are, in effect, asserting their United States rights and thus, US law will apply.

Receiving physical copies from abroad

Suppose an institution in the United States receives physical copies of works from overseas.  For example, someone might send TextPot (our Hypothetical academic text mining institution) a box full of old science fiction books or a box of French sitcoms recorded on DVD.

If these copies were made legally overseas, then under the first sale doctrine, there should be no problem under U.S. law with importing them into the US. Because of the way the import/export provisions of the Copyright Act (Section 602) are written, the relevant question is with respect to the making of the copies to be imported “would have constituted an infringement of copyright” if U.S. law had applied. If it would have, then importing those copies without the authority of the copyright owner infringes their US rights. If not, there is no U.S. infringement.

Suppose the copies were specifically made for the purpose of inclusion in a text mining corpus in a country where that would violate copyright law. Clearly this has legal significance for the person(s) who made those copies overseas, but importing those copies would not violate the US Copyright Act because the relevant question is whether the making of the copies to be imported “would have constituted an infringement of copyright” if U.S. law had applied. This makes sense because the right to distribute the work, like all of the copyright owner’s exclusive rights, is subject to the fair use doctrine as well as other more specific limitations and exceptions.

However, the export from the foreign source might infringe the overseas jurisdiction’s distribution right: it depends on how that jurisdiction implements its own first sale doctrine (i.e. whether it has national or international exhaustion).

If the relevant copies were not lawfully made overseas, exporting them would most likely violate the foreign equivalent of the distribution right in the sending country.

From a U.S. perspective, the law is reasonably clear that there is no domestic liability for acts of infringement that occur overseas.[30] Nor is there domestic liability for “authorizing” within the territorial boundaries of the United States of acts of infringement that occur entirely abroad.[31]

The final question is whether simply importing a copy that would be legal in the U.S. but unlawful in the source jurisdiction triggers liability for the U.S. receiver in the jurisdiction from whence the works came? The answer depends on the US receiver’s degree of involvement in the initial copying. If the US receiver explicitly or implicitly encouraged the making of the unlawful copies, it would quite probably be liable for the overseas infringement. On the other hand, if the receiver did not play an active part in the making of the unlawful copy in the first place, liability should only attach to the exporter.

Receiving/obtaining electronic copies from abroad via a computer network (i.e., a download, not a CD or DVD)

This scenario is the same as the one above, except that the works are not imported in fixed copies, they are transmitted over the Internet. However, this difference in mechanism changes the legal analysis quite significantly.

The single action of transmitting an electronic file from a country such as Australia to the United States without the authorization of the copyright owner would implicate the reproduction right in both jurisdictions. The sending party would clearly be liable in both jurisdictions and there is a reasonable prospect that the receiver would be liable in the US as well.[32]

There would be no liability under US law for either party if the action is deemed to be fair use, applying US standards. Clearly, if the reproduction violated Australian law the sending party would be liable for copyright infringement there. What is less clear is whether an Australian court would also hold that the American receiver had violated Australian copyright law.

Retention of copies and security

Suppose Search Corp Italia (a for profit entity) scans an archive of Italian poetry from the 1950s for text mining purposes and transmits the archive to the University of Evanston in the United States on the understanding that the works will only be used consistent with the U.S. fair use doctrine. Search Corp Italia then deletes its copies of the files. What does the University of Evanston need to know about the storage and retention of those files?

The University of Evanston would need to store the files with appropriate security to maintain its fair use status in the U.S.

How an institution manages file storage, retention, and security can have important legal implications, but it is important to understand that once a file has been copied onto a particular server, the failure to delete it does not have any independent copyright significance in the U.S.  There is no exclusive right to retain copyrighted works, and keeping something is not the same as reproducing it, distributing it, performing it, or displaying it. The same goes for security measures: failure to take adequate security measures can change how the initial copying is characterized, but simply having bad security does not trigger any of the exclusive rights of the copyright owner.

The fact that the University of Evanston has retained the files might take Search Corp Italia outside the scope of Article 4 of the DSM Directive. This is a problem for Search Corp Italia, but not for the University of Evanston.

Why would this raise an issue under the DSM Directive? If the EU text miner is not a non-profit research organization or cultural heritage institution, then it will have to rely on the more limited provisions of Article 4 of the DSM. One of the limitations of Article 4 is that the works may be retained only “for as long as is necessary for the purposes of text and data mining.”[33]

Generating and sharing data

Analytical processing by overseas researchers

Suppose that TextPot allows affiliated researchers from the EU to query the corpus? There are no copyright implications here as long as the process of turning text into data does not involve making a substantial copy of the underlying works, distributing those works, or  performing or displaying them.

As we explained in previous chapter on copyright, the distinction between protectable original expression and unprotectable facts and ideas is one of the universal building blocks of copyright law. Not just in the United States, but around the world. The non-expressive metadata the results from text data mining research doesn’t, by itself, infringe the copyright in any of the underlying works from which it was derived.

This is important. Building a research corpus usually involves substantial amounts copying. However, once the corpus has been created, the computational process of querying the database to produce metadata has no copyright significance. The derived data is not in any relevant sense a copy of the underlying works.

Accordingly, there should be no cross-border problem with giving anyone the ability to query the corpus as long as the result of that query is on the right side of the idea-expression distinction.

What if the overseas researcher is getting access to more than just derived data? For example, text snippets, illustrative examples, replication subsets? We’ll come to these questions shortly, but for now it’s important to understand they are different to the data-only scenario.

Sharing and using the data

For the reasons we just discussed, there shouldn’t be any cross-border issues with publishing derived data or making it available internationally.

Adjunct uses of original expression (snippets, verification, and validation)

Sometimes metadata is not enough.

It is very unlikely that the initial results of an academic text mining process could be taken at face value without some reference to the underlying works as validation. Our understanding of US law is that limited display uses for the purpose of the verification and validation of results would be well within the parameters of fair use. In addition, as the Google Books case illustrates, some limited expressive uses are also allowed if they are made for purposes, such as presenting results in context or allowing third parties to verify the accuracy or relevance of results. Classic transformative uses of this kind will be fair use so long as the amount displayed is reasonable in light of the underlying purpose and is unlikely to disrupt any cognizable market for the original work.

As discussed above, there should be no copyright law impediments to transferring data derived from an American text mining corpus overseas, but it’s possible that adjunct uses of original expression that would be considered non-infringing in the United States may violate copyright law in at least some overseas jurisdictions.

We are pretty confident that such adjunct uses would qualify as fair dealing in countries like Canada and Australia, but they seem to be beyond the scope of the TDM provisions of the new EU DSM Directive. Such adjunct uses may be allowed under the German text mining law. The German law permits the making the corpus available only to a “specifically limited circle of persons for their joint scientific research, as well as to individual third persons” for quality assurance. However, other exceptions and limitations may allow for similar results in other EU countries.

Recommendations: We think that the risk that making limited display uses for the purpose of the verification and validation of results violates copyright law is actually quite low in many overseas jurisdictions. A text mining project seeking to eliminate this risk would have to obtain jurisdiction-specific advice or simply limit the scope of access to persons within the United States through site access restrictions or geo-blocking.

Special issues relating to machine learning and AI

Can the contents of a machine learning algorithm infringe copyright in the training data?

Suppose researchers at TextPot train a machine learning algorithm on a corpus consisting of copyrighted works. In most cases, any features derived from the training set that become embedded in the machine learning algorithm won’t look anything like the original expression in the corpus itself. Accordingly, in the run-of-the-mill scenario, machine learning algorithms and their AI cousins don’t raise any new copyright issues. As discussed above, the data derived from a corpus is not a copy of any particular work in the corpus, it can be used for any purpose without fear of copyright liability. That analysis doesn’t change if the derived data is embedded in a machine learning algorithm.

Nonetheless, it’s worth considering a low probability scenario in which a machine learning algorithm did actually embody enough of the original expression from the training data that it constituted either an infringing reproduction, or an infringing adaptation.

This scenario is unlikely under United States copyright law given current thresholds of what it takes to conclude that one work is too similar to another work and our current understanding of the minimum amount of expression required to cross the threshold of copyrightability. Both of these thresholds appear to be somewhat lower in the EU, consequently the risk may be slightly greater outside the United States.

In the United States, even if the content of a machine learning/AI program did constitute a prima facie reproduction or adaptation of some underlying copyrighted work, that use would be just as protected by the fair use doctrine as the initial copying of the primary works into a database. However, the same machine learning algorithm might fall outside the narrower protections for TDM in some overseas jurisdictions.[34]

Recommendation: machine learning algorithms which embody non-trivial amounts of the original expression from copyright works should not be exported to a given jurisdiction without first ascertaining whether the algorithm might itself constitute an infringing adaptation of those works in that jurisdiction.

Works created by AI and machine learning techniques based on data derived from copyrighted works.

If the output of a machine learning algorithm is too similar to one or more of the underlying works in the algorithm’s training set, that new work will infringe copyright under traditional copyright law principles.

Imagine an AI program that uses songs by Taylor Swift as a training set and produces songs that are very similar to Taylor Swift songs as the output.

If the t-AI-lor Swift songs are too similar to works of Taylor Swift, the fact that an AI was used to create them is largely beside the point. But the much more likely scenario is that the AI would produce works that are in the same genre and share features in common with the works in its training set, but that the new works don’t actually meet any of the traditional tests of infringement.

In this much more plausible example, the mere fact that a work was created using data derived from a set of copyrighted works does not make the new work itself a violation of copyright.

Sharing the corpus

Access to the works that constitute the corpus

Making the entire research corpus available to the general public would be inconsistent with the fair use rationale for text data mining articulated in HathiTrust and reiterated in Google Books. However, an institution might give qualified researchers access to the corpus for research purposes related to text mining and still fall comfortably within the parameters of fair use in the United States. The more difficult question for our purposes is whether that kind of access needs to be limited to people within the United States.

Giving overseas researchers direct access to the corpus might violate the reproduction right in their home jurisdiction, and even if nothing is downloaded, it could violate the foreign equivalent of the public display right in addition to the “making available” right. It is possible that the foreign researcher’s actions would be covered by limitations and exceptions in their own jurisdiction, but that is something that would have to be reviewed on a country by country basis. If we assume for the sake of argument that no such limitation or exception applies, the US institution would violate foreign copyright law in this particular cross-border scenario.

Recommendations: Unless the risk of that limited research access would violate copyright law in a particular overseas jurisdiction has been assessed and is regarded as sufficiently unlikely, overseas researchers should only be given direct access to the corpus from within the United States (this seemed less problematic in the pre-coronavirus era). We suggest making this a condition of access and also using geo-blocking as a backstop.

Reproducing the corpus overseas

There may be legal, technical, and policy reasons to want to reproduce or mirror a research corpus in a second location. Assuming that the corpus was built in the United States for TDM purposes, we are confident that reproducing it at a second location within the United States for a similar TDM purpose would also be fair use.[35] The US fair use analysis would not change if the second location was in a foreign jurisdiction, even if this violated foreign law.

Conversely, the fact that the original corpus was constructed within the parameters of American fair use would not prevent the reproduction of the corpus in some foreign country being characterized as infringement if that country has not made any accommodation for the practice within its copyright law.

The legal rules and standards applicable to text data mining outside the United States are in a state of flux. Relatively few jurisdictions have passed relevant legislation or addressed the issue through case law or administrative regulation. Members of the European Union are required to enact legislation implementing the Digital Single Market Directive by June 7, 2021[36] and it is not yet clear how broadly or narrowly the individual EU members will choose to follow that directive.

Article 3 and article 4 of the DSM Directive require “lawful access” to the underlying work. Our position would be that lawful access means that the particular copy used as source material was not created unlawfully under the laws of the jurisdiction where it was created. However, we can easily imagine a more restrictive interpretation that limits the right to research under the Directive to copies made with the actual authorization of the copyright owner.

There is an opportunity here for positive action at the international level. We faced a similar situation with the provision of accessible works to people with visual disabilities in the Marrakesh Treaty of 2013.[37] The Marrakesh Treaty established some essential minimum standards for copyright exceptions to allow accessible works to be produced for people with visual disabilities. A major question dealt with the recent Marrakesh Treaty for the Blind[38] was similarly whether an accessible format copy lawfully made in one country (e.g. the USA under fair use) could be lawfully transferred to countries that lack clear rights to make similar copies locally. The Marrakesh Treaty solved the problem with a new international rule requiring contracting parties to allow the import and export of accessible format copies under certain conditions. The World Intellectual Property Organization (WIPO) is set to discuss research-related international limitations and exceptions at an upcoming meeting.[39] An import/export provision modeled on the Marrakesh Treaty should be part of that discussion.

Icon for the CC0 (Creative Commons Zero) license

To the extent possible under law, Sean Flynn and Matthew Sag have waived all copyright and related or neighboring rights to Building Legal Literacies for Text Data Mining, except where otherwise noted.

