Last year, the Library began a pilot project to extract data from the Catalog of Copyright Entries (CCE) published by the U. S. Copyright Office. For background, all books published in the United States more than 95 years ago (before 1924 at the time of this writing) are in the public domain in the U.S.; all books published in 1964 and after are likely still in copyright.
Books published between 1924 and 1964 are potentially in the public domain if their copyright terms were not renewed 28 years after publication. To find out if such a book is still in copyright, one needs to find the notice of renewal published within the many thousands of pages of the Catalog of Copyright Entries or in an online database of renewals at Stanford University.
For those of us interested in digitizing books, information about a book’s copyright allows us to identify material that can be shared digitally. Finding out whether a book is not in copyright requires searching the CCE (and unlike the renewals, there was no database of registrations), finding the copyright registration, and confirming there is no corresponding renewal. Transcribing the CCE book registrations has made it possible to identify the registration and corresponding renewal.
Back in May, as soon as we had all the records, I crunched the data and made a rough estimate that 70% of the books from this time period might no longer be in copyright—around 480,000 public domain books—which raises the question: Has the Library made a list of all the public domain books?
Copyright and Transparency
The answer is, no, we haven’t, for the simple reason that failing to match a renewal alone isn’t enough to determine the copyright status of a work. The historical records are hard to use and, once you find the record, you need to apply a lot of outside knowledge to interpret it. The records we are interested in—books published between 1924 and 1964—were created for a copyright law that no longer exists and, being paper records, they weren’t updated.
For example, many foreign-published books were registered and entered the public domain in the US—and the US alone—when their copyrights weren’t renewed. That’s the story the records tell. What the records don’t say is almost all foreign-published books had their copyrights retroactively restored in 1994 by the Uruguay Round Agreements Act (more on this below) if they were still in copyright in their source countries. Copyright often depends not on a single act, like registration, but on a whole history of publication that can be complicated to tease out.
The complexities of copyright have led to a copyright-by-default atmosphere where only the easiest facts to prove, like publication before 1924, are accepted without extensive documentation. The work we have done to make these CCE records machine-readable does a lot to create the transparency we need, but it is only a start. This step, however, helps us figure out the next, and each step will make these historical records more transparent and easier to use.
At the Library, we’ve been discussing the peculiarities and difficulties of this data, I thought it would be helpful to explain why identifying copyright, or lack thereof, is not quite as simple as having a computer match (or not match) a string of letters and numbers, and why we’re not at the point where we can simply generate a list of public domain books. Each step in the process (except matching registrations and renewals—that’s now the easy part) demonstrates how some outside knowledge or extra work must be applied to interpret the CCE records. Sometimes you have to read between the lines.
To demonstrate this, I’m going to take all the books registered in single month, August 1933, and show what it takes to whittle that list down to those that are possibly in the public domain. August 1933 was a slow month for the Copyright Office, with only 754 copyrights registered. (A raw version of the data used in this post is available here.)
Here's where I have to say I am not a lawyer and none of this is legal advice.
Matching Registrations and Renewals
If this is the easy part, how exactly do we match registrations and renewals? A registration entry looks like this:
[Stein, Gertrude] 1874- The autobiography of Alice B. Toklas … New York, Harcourt, Brace and company [ͨ1933] vii, 310 p. front., plates, ports., facsim. 22½ͨͫ. The life of Gertrude Stein written by herself as though it were the autobiography of her secretary, Alice B. Toklas. "First edition." © Aug. 31, 1933; 2c. and aff. Sept. 2; A 64890; Harcourt, Brace & co., inc. (33-22918)
It has a registration number, "A 64890", and a copyright date, Aug. 31, 1933 (along with some other information). To see if this has been renewed, we look for a renewal for a book with the matching number/date (in normalized forms) A64890/1933-08-31. There is one, which looks like this:
STEIN, GERTRUDE. The autobiography of Alice B. Toklas. © 31Aug33; A64890. Alice B. Toklas (E); 25Oct60; R264960.
These two pieces of information tell us the The Autobiography of Alice B. Toklas was copyrighted on Aug. 31, 1933 with the registration number A64890, and renewed on October 25, 1960 with the renewal number R264960. Because this was renewed, it is still in copyright.
If we look at another entry, for New York Madness:
Bodenheim, Maxwell, 1893- New York madness, by Maxwell Bodenheim … New York, The Macaulay company [c1933] 3 p. l., 9-250 p. 19½ͨͫ. © Aug. 11, 1933; 2c. and aff. Aug. 12; A 64707; Macaulay co. (33-21133)
… we find no corresponding renewal for A64707/1933-08-11, so New York Madness is probably in the public domain.
This much seems simple enough. Using the printed volumes, this would be laborious, and it would be hard to be sure that you had checked every page of every volume in which you might find the renewal. Of the 754 books from August 1933, 575 (76%) have no renewal. So why can’t we just take the entire list of registrations, remove the ones with matching renewals, and call that a list of public domain books? There are a few reasons a book with no apparent renewal might still be in copyright: There may simply be data problems. There is copyright restoration for foreign-published books. The entry may only be a registration for "new matter."
We’ll look at each of these, as well as a few odds and ends.
We have to be honest. There are typos.
Scanned images of the CCE pages were sent through optical character recognition, which isn’t perfect, and then subjected to automated and manual quality assurance. The data is accurate, but there is a lot of it. In just these August 1933 entries, there are 209,645 characters (not counting spaces); with 99.9% accuracy, we would expect there to be about 210 errors.
Some errors may be more or less harmless than others. A mistake in the number of pages is not as bad as a mistake in the title (which might cause a title search to fail), which is not as bad as an error in the registration number (which might cause the renewal matching to fail). There are a lot of clever things we can do to catch and correct these errors with the aid of the computer. There are also slow and laborious things we can do without that aid.
Lack of a renewal is still proving a negative but once we can match every existing renewal with the item it renews, that will give us greater confidence. For example, if we had 2,000 registrations and 1,000 renewals, but could only match 800 pairs, we would think the data needs cleaning up—essentially where we are now. If we sort out the 200 leftover renewals and find all 1,000 pairs, then we can be more certain the 1,000 unmatched registrations are truly unrenewed. This cleanup is underway, at least in small ways, though we have more CCE volumes to digitize before we can say we should have a registration, in machine-readable form, for every renewal.
While working on this blog post, I corrected about five critical errors in the records, which needs to be taken into account before making any claims based on this data. And we haven’t even discussed typos in the printed volumes, which are accurately reproduced. In fact, as this is an open source project, if you find an error, we encourage you to create an issue in our repository so we can correct it. If you’re familiar with git and comfortable editing XML documents, you can even fix it yourself.
Foreign Publications and GATT
One-hundred ninety-eight of these 754 books can easily be identified from their registration number as published outside the United States. Books published outside the U.S., and not in English, were classed as "A-Foreign" and later "AF" (there are 107 in our set); English language books were classed "A ad interim" or "AI" (91). All these books should probably be considered in copyright, whether renewed or not. The reasons take a little explanation.
At the time the books were registered, these classes were important because some books had to be printed in the United States to secure U.S. copyright. This requirement is called the "Manufacture Clause" and had nothing to do with the rights of the author; instead, it was intended to protect American printers from foreign competition.
Before 1909, it was required of all books. After 1909, the clause was relaxed so that non-English books did not need to satisfy this requirement (since they had only a small market in the U.S.), while English-language books could get a temporary or "ad interim" copyright, which gave books a short window of time to be printed in America.
These classes are important now because of something called "GATT restoration." Everywhere, copyright is a matter of national laws and, for most of its history, the U.S. has had much shorter copyright terms than other nations. Since 1886, an increasing number of countries have operated under the Berne Convention for the Protection of Literary and Artistic Works. Instigated by Victor Hugo, the Berne Convention requires nations to respect each other’s copyrights and give authors copyright protection that's at least as good as in their home countries. It also requires copyright to be automatic, and forbids any need for registration or other formalities. At a period of 28 to 56 years, the U.S. copyright term was notably shorter, and its paperwork and manufacturing requirements burdensome.
The U.S. finally joined the Berne Convention March 1, 1989 and, with the Uruguay Round Agreements Act (URAA), granted retroactive copyright in 1995 to most foreign publications. This is known as "URAA restoration" or, because the "Uruguay Round" was part of negotiations on the General Agreement on Tariffs and Trade (GATT), as "GATT restoration." These rules are now section 104A of the current copyright law.
This means that for a book originally published abroad, it is almost certainly irrelevant whether it was renewed or not. Even if the book was not renewed and was in the public domain in the U.S., if it is still in copyright in its source country, it has been placed back in U.S. copyright as of 1995. If a book is in public domain in its source country, it could be in public domain in the U.S.. You just need to be familiar with the copyright laws of dozens of nations…
Once we take out the foreign-published AF and AI registrations from our list, that leaves 556 U.S.-published books still to consider.
The next step is to look for renewals. This is now the easy part, as explained above, and we find 179 (27%) that were renewed. Notable among these are Gertrude Stein's, The Autobiography of Alice B. Toklas and Booth Tarkington's, Presenting Lily Mars.
Interim registration is a two-step process and, because of this, we might still have some foreign publications among our candidates. When an English-language book was published outside the United States, the author or publisher could apply for "ad interim" protection, which gave them a small window of time to print the book in the U.S.. For instance, this is one of our interim registrations, AI18070:
Alington, Argentine Francis] 1898- Gentlemen—the regiment! By Hugh Talbot [pseud.] London, J. M. Dent & sons, limited  4 p. l., 3-407,  p. 19ͨͫ. © 1c. Aug. 5, 1933; A ad int. 18070; pubd. June 8: Argentine Francis Alington Summertown, Oxford, England . (33-29651)
It was published in the U.S. later in the year (outside of our sample set) and registered as A68534:
[Alington, Argentine Francis] 1898- Gentlemen—the regiment! By Hugh Talbot [pseud.] New York and London, Harper & brothers, 1933. 4 p. l., 3-407,  p. 21ͨͫ. “First edition.” © Dec. 5, 1933; 2c. Dec. 7; aff. Dec. 8; A 68534; A. F. Alington, Summertown, Oxford, England . (33-38448)
We already excluded all the interim registrations from August and that took care of step one. The interim is enough for us to say those books are probably still in copyright, so there is no point looking any further. But now, we have to consider step two. Take, for example, this registration entry:
Knox, Alexander. Bride of quietness, by Alexander Knox. New York, The Macmillan company, 1933. 3 p. l., 302 p. 19½ͨͫ. © Aug. 8, 1933; 2c. and aff. Aug. 9; A 63848; Macmillan co. (33-20823)
This is one of our 556 books without a renewal—could it be public domain? In fact, no. A little searching finds a previous interim registration from April, 1933:
Knox, Alexander. Bride of quietness, by Alexander Knox. London, Macmillan and co., limited, 1933. 2 p. l., 302 p 19½ͨͫ. © 1c. Apr. 19, 1933; A ad int. 17702; pubd. Feb. 24; Macmillan co., New York. (33-11374)
There are at least 12 unrenewed books in our August 1933 group that have previous interim registrations. We have to disqualify them as candidates for the public domain because, as foreign publications, they are eligible for GATT restoration. I say "at least" because it's possible I missed one or two. This is an example of how a single entry is not enough to decide whether something is in copyright or not.
There isn’t much of a clue that Bride of Quietness was first published abroad, but having a fully searchable database of the CCE at least makes is easier to perform this check. Can we automate it? Mostly, I think. In later volumes of the CCE, registrations often have a note with the date and ID of a previous interim registration, though we’re not sure what percentage have those indications. Renewals almost always list both the AI and the A registration. I would like to see all entries with previous interim registrations tagged in our data with an explicit link. Once all those with some indication are tagged, title and author matching, along with checking by hand, would resolve many of the remaining AI entries leaving only those AIs that were never actually printed in the U.S.
That still won’t be quite enough. This entry has no renewal and no corresponding interim registration. But the phrase "First American edition" caught my eye.
[Mercer, Cecil William] 1885- … The stolen march. New York, Minton. Balch & company, 1933. 4 p. l., 7-319 p. 19½ͨͫ. Author’s pseud., Dornford Yates, at head of title. "First American edition" © Aug. 11, 1933; 2c. Aug. 28; aff. Aug. 26; A 64841; Minton, Balch & co. (33-22163)
By looking in library catalogs, I was able to find it was first published in England in 1926. Remember the Berne Convention forbids any requirement for registration, so the existence of the earlier English edition makes this eligible for GATT restoration, whether or not there is any record of it in U.S. Copyright registrations. Because knowing the copyright status of a book may depend on information nowhere in the copyright records, but readily available in library catalogs, the real killer app will be something that brings these two sources together!
You may find it hard to believe, but there's more. If fewer than 30 days passed between the foreign publication and the U.S. publication, that is considered "simultaneous publication" and disqualifies the work for GATT restoration (none of our sample seems to be in this category). Books are also not eligible for GATT restoration if the author was a U.S. citizen. There are many minor exceptions like this. Copyright can be a wheels-within-wheels affair.
With our 12 easily identifiable foreign publications, and one secretly British book, we are now down to 364 possibly public domain titles.
Copyright registrations aren’t always for for an entire book. "New Matter" is a catch-all term for things such as new editions, illustrations, or forewords. If the new matter is not renewed, it is technically not in copyright; but this makes no practical difference if the "new matter' only consists of revisions of something in copyright. For example:
Fuller, Robert Warren, 1871- First principles of physics, by Robert W. Fuller, Raymond B. Brownlee and D. Lee Baker … Boston, New York [etc.] Allyn and Bacon [c1933] viii, 799, 13 p. col. front., illus., col. plates, ports., diagrs. 19½ͨͫ. © Aug. 2, 1933; 2c. and aff. Aug. 10; A 64687: R. W. Fuller, Westport, Conn., Raymond B. Brownlee, Woodmere, N. Y., and D. Lee Baker. New Rochelle, N. Y. [© new matter] (33-20800)
There is no renewal associated with this entry, A64687, but "© new matter" means the question is really about the status of any previous editions. There is an earlier 1932 edition, and a later one from 1937. None of these was renewed, so all three editions look to be public domain—all the 1933 and 1937 revisions, along with the unchanged parts of the 1932 original.
On the other hand, this entry is registering a foreword for a previously published book.
Collins, Hubert Edwin, 1872- Warpath & cattle trail, by Hubert E. Collins; with a foreword by Daniel Carter Beard. Boys’ ed. Illustrated by Paul Brown. New York, W. Morrow & company [c1933] xix p., 1 l., 296 p. incl. front., illus., plates. 22½ͨͫ. Illustrated lining-papers. © Aug. 23, 1933; 2c. and aff. Aug. 25; A 65026: William Morrow & co., inc. [© foreword] (33-22289)
There is no renewal for A65026. But, it turns out the original publication of Warpath & Cattle Trail in 1928 (A1053468/1928-09-20 with a different foreword) was renewed. The text of the 1933 foreword may be public domain but the principal work is not.
Forty-eight of our remaining candidates have some obvious indication of new matter. At least four of those are revisions or additions to works still in copyright. As with interim registrations, entries don’t always have obvious indications of new matter or previous editions, such as this one:
Rukeyser, Merryle Stanley, 1897- The common sense of money and investments, by Merryle Stanley Rukeyser … New York, Industries publishing company [c1933] xvii, 333 p. 19ͨͫ. Bibliography: p. 317-319. © Aug. 17, 1933; 2c. Sept. 8; aff. Sept. 7; A 65157: M. S. Rukeyser, New York . (33-23667)
By searching on the title, we find this was previously published in 1924 and renewed in 1951, so the 1933 edition is still in copyright.
There is one other book in this category (Industrial geography; production, manufacture, commerce by Ray Hughes Whitbeck), so we need to take these six out of our list of candidates. We now have 358 books that don’t have renewals, and don’t seem to be foreign publications or revisions of in-copyright works. There are other factors we could examine, such as whether a book contains previously published newspaper or magazine articles, but the aim here isn’t to be exhaustive. The score so far is:
Foreign publications (AF)
Interim registrations (AI)
Found to be foreign
Previous edition in ©
Remaining, public domain candidates
In my previous blog post, I counted only entries that were not classed AF or AI to get the estimate of 75% not renewed. Following the same method for just the August 1933 registrations, we have 377 books not obviously renewed of the 556 seemingly U.S. publications, or 68%. By looking more closely, we excluded 19 more of those, leaving us 64% as possible candidates for the public domain.
So Those Can Be Digitized, Right?
Going back to the idea that a copyright registration is only a moment in the publishing history of a work, for some of these titles, the registration does seem to represent the whole history. About 250 of the possible public domain books on the list can be easily connected with library records and 37 of those, like New York Madness by Maxwell Bodenheim, seem to be represented by only one edition in all the library collections—good evidence there is no foreign edition or other complication to discover.
When we digitize books, though, we usually mean scanning the whole of a single physical artifact, such as a particular copy of a particular edition of The Autobiography of Alice B. Toklas. This "book" does not have a copyright, only the works it contains do. Even when you have a good candidate, there is still more work to do since the physical publication can be a bundle of different rights. As we saw in case of "new matter" registrations, illustrations, photographs, forewords, revisions, and other "works" have separate copyrights. These types of third-party creations are generally called "inserts" and few are reflected clearly or at all in any records.
This is one of the toughest problems in clearing copyright for digitization. Most of the time, it comes down to someone flipping through the book to see if there’s anything that makes them wonder, "What are the rights on that?" The parts of books in the public domain can still be digitized, but without an automated way to recognize and remove in-copyright works and inserts, every exception slows down digitization and makes it harder to do at a large scale.
What Can We Improve?
Much of what has been discussed above demonstrates the lack of transparency we face with these records. I also mentioned one area, links between registrations and previous interim registrations, where the data could be enhanced to make improvements. That’s just one example of how a small amount of data could be added—links to any related records also in the CCE would make it easier for a researcher to tell the "copyright story" of a work.
Internal evidence is one thing, but being able to map these records to external data sources will be important. Library catalog records are the obvious place to start. Many entries up through 1937 even have Library of Congress Catalog Numbers already! For most entries, though, someone will have to figure out a clever way to match records by author, title, date, etc., as accurately as possible. Other sources, such as Books in Print, are good candidates, too.
These ideas are based on what we learned so far just from having this data available, so we have no concrete plans for anything like this yet at NYPL. The data is all open source (public domain, in fact) and we are excited to see what others can do to use, improve, or enhance it (and we invite the public to discuss it with us, perhaps by opening an issue in the repository).
We have received a grant from the Institute of Museum and Library Services to convert more of the CCE to XML. The Library’s goal, as outlined at the start of this project, is to get beyond books and unlock this entire record of American creativity.